Anal. Chem. 2006, 78, 4598-4608
Automatic Selection of Optimal Savitzky-Golay Smoothing Gabriel Vivo´-Truyols* and Peter J. Schoenmakers
Polymer-Analysis Group, van’t Hoff Institute for Molecular Sciences, University of Amsterdam, Nieuwe Achtergracht 166, 1018 WV-Amsterdam, The Netherlands
A method to select the optimal window size of the Savitzky-Golay (SG) algorithm is presented. The approach is based on a comparison of the fitting residuals (i.e., the differences between the input signal and the smoothed signal) with the noise of the instrument. The window size that yields an autocorrelation of the residuals closest to the autocorrelation of the noise of the instrument is considered optimal. The method is applied in two steps. In a first step, the lag-one autocorrelation value of the noise of the instrument is computed through the study of a blank signal. In a second step, the SG algorithm is applied to “smooth” the signal using different window sizes. The method was applied to data from NMR, chromatography, and mass spectrometry and was shown to be robust. It finds the optimal window size for different signal features. This allows the method to be used in an unsupervised way, embedded in a more complex algorithm in which smoothing and/or differentiation of signals is required, provided that the lag-one autocorrelation value of the instrument noise does not change. The current trend is for the amounts of data produced in a laboratory to increase dramatically. Complex analytical instruments are increasingly abundant (e.g., comprehensive twodimensional chromatography, chromatography hyphenated with tandem mass spectrometry), which causes an exponential increase in the size of data sets produced. As a consequence, an urgent demand for data analysis methods has emerged to convert these enormous quantities of data to information (and, moreover, to knowledge). It is argued that in most cases the data treatment step now constitutes a major bottleneck in the analytical process. From the perspective of a computer programmer, the required complex statistical tools are normally built following the “divide and conquer” rule: the entire problem is divided in a number of much simpler sub-problems. Simple routines are assembled in such a way that the whole program is able to perform a complex task. In this context, the development of unsupervised methods is crucial. When a complex program is built, asking for user intervention in each of these simple routines would constitute a great disadvantage. For any data treatment program that consists of more than a few steps, this would be totally impractical. Therefore, it is an absolute necessity that fully automatic building blocks are developed for essential tasks in analytical data analysis. * Corresponding author. E-mail: +31 205256576. Fax: +31 205255604.
[email protected].
4598 Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
Phone:
Since Savitzky and Golay published their work in 1967,1 their simple smoother has become an almost universal method to improve the signal-to-noise ratio of any kind of signal. Indeed, the Savitzky-Golay smoother (SG) has been applied not only within analytical chemistry but also in other disciplines, including satellite data analysis,2 forensic science,3 or fission activity in nuclear reactors.4 Other methods exist that can compete with the SG algorithm, such as the Whittaker smoother,5 Kalman filtering,6 or Fourier transform filters. Despite this competition, the success of the SG smoother is beyond doubt. It is due to two features (i.e., simplicity and low (computer) costs). A simple local leastsquares regression is performed, the computation time of which is normally insignificant (even on an average PC) since the regression operation (the inversion of the matrix design) is performed only once in advance. Another important advantage of the SG algorithm is that it provides the derivatives of the input signal without additional computing costs.7 This benefit becomes important if one considers that the numerical computation of the higher order derivative gives rise to a noise increment. For that reason, the computation of signal derivatives is normally linked to a smoothing technique. This makes the SG smoothing especially attractive since both the smoothed signal and the derivatives can be calculated in a single step. Basically, smoothing consists of the elimination of the noise from the signal with the lowest possible signal distortion. Since the smoothing function will not be an exact description of the noise-free signal, signal distortion will be always present. A too “flexible” smoothing leaves too much noise in the signal, whereas a too “rigid” one will yield signals in which some informative features are lost. Therefore, some kind of compromise must be found. Finding such a balance can be rather difficult (and, in certain cases, almost impossible), depending on both the features of the signal and on the noise. Moreover, the optimal smoothing depends on features that can change from sample to sample (e.g., sampling frequency of the instrument, noise variance, or the peak(1) Savitzky, A.; Golay, M. J. E. Anal. Chem. 1964, 36, 1627-1639. (2) Jo ¨nsson, P.; Eklundh, L. Comput. Geosci. 2004, 30, 833-845. (3) Schneider, R. C.; Kovar, K. A. Forensic Sci. Int. 2003, 134, 187-195. (4) Harnden-Gillis, A. C.; Bennett, L. G. I.; Lewis, B. J. Nucl. Instrum. Methods Phys. Res., Sect. A 1994, 345, 520-527. (5) Eilers, P. H. C. Anal. Chem. 2003, 74, 3631-3636. (6) Fahrmeir, L.; Wagenpfeil, S. Comput. Stat. Data Anal. 1997, 24, 295-320. (7) Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; de Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemomentrics and Qualimetrics, Part A; Elsevier: Amsterdam, 1998; Chapter 20. 10.1021/ac0600196 CCC: $33.50
© 2006 American Chemical Society Published on Web 05/23/2006
shape of the signal8) or even within the same experiment (e.g., with time or with wavelength). In the SG algorithm, the smoothing can be controlled with two parameters, i.e., the window size and the polynomial degree. Small window sizes and high polynomial degrees yield noisy signals (“flexible smoothing”), whereas large window sizes and low polynomial degrees may yield distorted signals (“rigid smoothing”). Between the two parameters, the window size allows a better fine-tuning of the smoothing. Commonly, this is the only parameter considered to find the optimal smoothing. For that reason, we decided to keep the polynomial degree constant, optimizing the process by varying only the window size. However, the same principles are applicable when optimizing the polynomial degree. It is hard to find statistical treatments aimed at optimizing the window size in SG smoothing in the literature. Enke and Nieman9 studied the signal-to-noise-ratio enhancement obtained with the SG algorithm. However, in their approach the signal features (e.g. peak width) should be known in advance. This jeopardizes the applicability of the method in an unsupervised fashion since, as mentioned above, the features can change from sample to sample and, moreover, within the same experiment. Another approach in which not only the window size but also the polynomial degree was optimized was developed by Jakubowska and Kubiak.10 Their method was based on checking the significance of the improved fitting when the polynomial degree was increased. In most cases, the optimization of the smoothing is, however, performed in a subjective and interactive manner. Different window sizes and polynomial degrees are applied until a visually pleasing smoothed signal is obtained. As explained before, a subjective and interactive method to select the optimal smoothness cannot be used in situations in which the smoothing (or the computation of derivatives) constitutes one step within a complex program. Fully automated methods are absolutely necessary in all situations in which the features of the signal change, so that the window size for optimal smoothing must be adapted. This includes situations in which the derivative of a signal is needed, such as in the deconvolution of chromatograms11 or the search for spectral differences in data sets.12 In this work, an automated method is presented to optimize the performance of the SG algorithm through the selection of the window size. The algorithm finds the optimal window size for different signal features, noise levels, and sampling frequencies without user intervention. The only parameter that must be known in advance is the auto-correlation function of the noise of the detector. An experimental design was carried out with simulated data to investigate the robustness of the proposed algorithm. The method was also tested on three different real examples: NMR spectra, liquid chromatography, and mass spectra. The proposed approach automatically found the optimal smoothing conditions in all these situations. (8) Karjalainen, E. J.; Karjalainen, U. P. Data Analysis for Hyphenated Techniques; Elsevier: Amsterdam, 1996. (9) Enke, C. G.; Nieman, T. A. Anal. Chem. 1976, 48, 705A-712A. (10) Jakubowska, M.; Kubiak, W. W. Anal. Chim. Acta 2004, 512, 241-250. (11) Torabi, K.; Karami, A.; Balke, S. T.; Schunk, T. C. J. Chromatogr. A 2001, 910, 19-30. (12) Woo, Y.; Kim, H.; Ze, K.; Chung, H. J. Pharm. Biomed. Anal. 2005, 36, 955-959.
In previous work, a similar idea was proposed to optimize the SG smoothing of chromatographic signals.13 The principle was, however, more simple since the correlation of the noise of the instrument was not taken into account (it was assumed that the instrument yielded only uncorrelated noise). Although this can be a good approximation, the previous method was imperfect. THEORETICAL BASIS Noise Correlation. Let us assume a noise vector a ) (a1, a2, ..., ai, ..., an), which contains only a random variation with standard deviation σa and mean zero, where i ) (1, 2, ..., n) denotes the discrete time increment in units of the sampling interval, ∆t. From basic statistics, it can be demonstrated that a new vector b ) (a2 - a1, a3 - a2, ..., an - an-1), obtained by subtracting each element from the next one, also exhibits a mean-centered random variation, the variance of which is approximately double that of a (and, consequently, the standard deviation of the new population is σb ≈ σax2). Formally, the following relationship applies n
∑ (a - a i
i-1)
2
(n)
i)2
)
(n - 1)
n
∑ (a )
2
(σb)2 (σa)2
≈2
(1)
i
i)1
Note that eq 1 is similar to that used in the Durbin-Watson test14 to check for underfitting or overfitting by regression models. In their test, Durbin and Watson defined the DW statistic: n
∑ (a - a i
DW )
2 i-1)
i)2
(2)
n
∑ (a )
2
i
i)1
which is equivalent to the left-hand part of eq 1. The correction n/(n - 1) is added to compensate for the difference in size of vectors a and b. This correction becomes nonsignificant for large vectors (n > 100). The DW value approaches 2 when a contains uncorrelated residuals.15 A regression model does not show overfitting or underfitting when its residuals are uncorrelated. Conventionally, noise that contains only random (uncorrelated) variation is called white noise. When the variation is correlated, one speaks about colored noise. The color of the noise can be measured using the autocorrelation function: n
∑
Fh ) 1 -
1 i)1+h 2
(ei - ei-h)2 n
∑ (e )
2
(n) (3)
(n - h)
i
i)1
where h is the lag of the correlation, that is, the distance (number (13) Durbin, J.; Watson, G. S. Biometrika 1950, 37, 409-428. (14) Rutledge, D. N.; Barros, A. S. Anal. Chim. Acta 2002, 454, 277-295. (15) Mann, M. E.; Lees, J. M.; Clim. Change 1996, 33, 409-445.
Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
4599
of points) between the data points considered. In this equation we use the vector e (which contains correlated noise) to distinguish it from the vector a in eq 1, which contains only random noise. The lag-one autocorrelation coefficient, F1, defines the color of the noise. We speak of “red” noise when F1 is positive, “blue” noise when it is negative, and “white” noise when it is 0. It follows that, when F1 ) 0 (white noise), then e ) a and eq 1 can be derived from eq 3. One should note that eq 3 is not identical to the classical definition of the autocorrelation function given in ref 7. We prefer to use eq 3 instead of the definition given in ref 7 because the connection between this definition and the Durbin-Watson test is clearer. The equivalence of both definitions is demonstrated in the Appendix. A plot of the values of Fh versus h is called an autocorrelogram. This constitutes a method to characterize the noise of an instrument. Given a blank signal, a series of autocorrelation coefficients F ) (F1, F2, ..., Fk) can be calculated and plotted versus h (or versus the corresponding time units if the sampling interval ∆t is known). The simplest model for correlated (colored) noise is the firstorder autoregressive Markov process:16
ei ) F1ei-1 + ai
(4)
where e1, e1, ..., em are elements of the colored-noise vector e, F1 (-1 < F1 < 1) is the lag-one autocorrelation coefficient, and a1, a1, ..., am are elements of the white-noise vector a with standard deviation σa (as defined before). When eq 4 holds, the autocorrelogram has an exponential form:
Fh ) exp
(-h∆t τ )
(5)
where ∆t is the sampling interval, and τ is the characteristic decay time, defined by
τ)
-∆t log F1
(6)
From eqs 5 and 6, it appears that a first-order autoregressive process can be characterized with only one parameter (either τ or F1). The total mean standard deviation of the error e is16
S0 )
x
σa2
1 - F12
(7)
One should note that eqs 4-7 hold only for a first-order autoregressive process. Savitzky-Golay Algorithm. The Savitzky-Golay algorithm is well-known, and it will not be explained in detail. Only some terminology will be defined here. More information can be found (16) Vandeginste, B. G. M.; Massart, D. L.; Buydens, L. M. C.; de Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of Chemometrics and Qualimetrics, Part B; Elsevier: Amsterdam, 1998; Chapter 40.
4600
Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
in the literature.17 Suppose that yexp is a (noisy) signal of m observations sampled at equal distances. We can assume that yexp contains two sources of variation, namely, the theoretical signal (yˆ) and the noise (e):
yexp ) yˆ + e
(8)
The principle of the SG algorithm is the following: (i) a data interval (i.e., window size) is selected, (ii) a low-order polynomial function is fitted to the data within this interval, and (iii) the experimental point is replaced by the polynomially predicted one at the center of the interval. The process is repeated after shifting the window ahead by one sampling interval. This results in a smoothed signal, ysmd, the length of which, m′, is
m′ ) m - w + 1
(9)
where w is the window size. We define yexp′ as the cropped vector obtained from eliminating the values at the extremes of yexp in order to match the dimensions of ysmd and yexp. To keep at least one degree of freedom in the regression, there is a minimum window size (w) for each polynomial degree, p (i.e., w g p + 1). Also, w is always an odd number. As mentioned, the extent of smoothing is controlled by the values of w and p. However, we can exert more control on the smoothing by changing w instead of p. For that reason, we decided to keep the value of p constant and to control the extent of smoothing by changing only the window size. A value of p ) 2 will be used throughout this work. However, the approach is equally applicable for different p values. As mentioned, the SG algorithm provides not only the smoothed signal, ysmd, but also the derivatives of yexp. Instead of replacing the central point in the window by the value of the polynomial (to obtain ysmd), one may replace it by the value of the first (or nth order) derivative of the same polynomial at the same point. To avoid confusion, we will define yd1 as the firstorder derivative and, in general, ydn as the nth-order derivative. The dimensions of ydn are the same than as those of ysmd (and yexp′). It follows that the maximum order of derivative that can be obtained is equal to p. Selection of the Proper Window Size. The main goal of any smoothing procedure is to retrieve yˆ (eq 8) with as much fidelity as possible. In mathematical terms, this means finding the window size (w) that makes ysmd as similar as possible to yˆ ′ (where the symbol ′ means, as before, that the vector has been cropped to have the same dimension as ysmd). This is equivalent to finding the w value that makes the residuals of the smoothing, yres ) yexp′ - ysmd, as close as possible to e′. The similarity between yres and e′ can be expressed through the autocorrelation functions of both e′ and yres. Formally speaking, the method of finding the optimal window size involves two steps. In a first step, the autocorrelogram of the noise of a blank signal (e) is calculated as outlined earlier, and the lag-one autocorrelation coefficient (F1,e) is calculated for the (17) Vivo´-Truyols, G.; Torres-Lapasio´, J. R.; van Nederkassel, A. M.; Vander Heyden, Y.; Massart, D. L. J. Chromatogr. A 2005, 1096, 133-145.
noise (subindex e). This step can be performed once the instrument settings are fixed, as long as the noise characteristics do not vary. In a second step, the SG algorithm is applied a number of times to smooth the signal, using a different window size each time. The lag-one autocorrelation coefficient (F1,res) is monitored (the subindex res indicates that F1 is calculated for yres). The window size yielding a value of F1,res closest to F1,e is considered to be optimal. We want to stress the link between the Durbin-Watson test and this work. Since the Savitzky-Golay method consists basically of a set of local regressions, it should theoretically be possible to check for under- or over-fitting with a test similar to DurbinWatson’s test. EXPERIMENTAL SECTION Apparatus. Nuclear Magnetic Resonance (NMR). Solid-state T2-relaxation NMR experiments were performed on a Bruker (Karsruhe, Germany) 500-MHz NMR spectrometer. In T2relaxation experiments (i.e., spin-spin relaxation), the relaxation is due to long-range motions. These experiments are extensively used for polymer network characterization.18 In this particular case, several Hahn-echo pulse sequences (HEPS: 90°x-tHe180°x-tHe-acquisition) were performed, with a tHE value between 0.2 and 4 ms. For this work, only a single example spectrum was analyzed with the SG smoother, corresponding to a relaxation time of 0.6 ms. This spectrum was selected on purpose in order to test the SG filter: at this relaxation time the two components present in the sample (one more mobile and the other more rigid18) could be distinguished using the derivatives of the signal (but only if a proper SG smoother was used). The experiments were carried out at 213 K on the cross-linked samples (see the NMR section below for sample description). Liquid Chromatography (LC). A size-exclusion chromatography experiment (SEC) was performed on a polymeric sample (see LC section below for information about the sample). The chromatograph consisted of a Shimadzu (Hertogenbosch, The Netherlands) pump model LC-10ADvp operated at a flow rate of 0.4 mL/min, and equipped with an autosampler (SIL-10ADvp) and a ELSDLTC (Shimadzu). The ELSD was operated at 40 °C and 350 kPa. The separation column was a Mixed-C SEC column (50 × 7.5 mm i.d., Polymer Laboratories, Church Stretton, Shropshire, UK) operated at 40 °C. Mass Spectrometry (MS). A matrix-assisted laser desorption ionization time-of-flight mass spectrometry (MALDI-ToF-MS) experiment was carried out on a set of proteins (see NMR section below for information about the samples). A Kratos Axima CFR MALDI instrument (Kratos Analytical, Manchester, UK) was used, with an acceleration voltage of +25 kV. The reflectron mode was used for detection; 250 laser shots were summed to obtain the final spectrum. Reagents and Samples. NMR. The cross-linked polymeric samples were supplied by DSM Resins (Zwolle, The Netherlands) as unsaturated polyester resins in 33 wt % styrene. A catalyst (0.5% of a 1% cobalt solution) and a 2% solution of benzoyl peroxide (18) Litvinov, V. M.; Penning, J. P. Macromol. Chem. Phys. 2004, 205, 17211734.
initiator were added. The polyester chains were formed by a polycondensation reaction of maleic anhydride (MA) and o-phthalic anhydride (OPA) with 1-2 propylene glycol and ethylene glycol. The material was brought between two glass plates, degassed, and then cured for 24 h at room temperature. Thereafter, two post-curing steps were performed during 24 h each at 60 and 80 °C, respectively. For the NMR experiments, the resulting networks were ground to a fine and apparently homogeneous powder using a cryogenic grinder. LC. The sample consisted of a polystyrene standard with a molecular mass of 71 kDa obtained from Polymer Laboratories with a specified polydispersity of 1.03, dissolved in THF (Biosolve, Valkenswaard, The Netherlands). MS. A 0.1% TFA (spectrophotometric grade, Sigma-Aldrich, St. Louis, MO) solution was prepared first. The sample was a protein mixture dissolved in the TFA solution up to a concentration of 3 g/L. The matrix was a mixture of IAA (3,β-indolacrylic acid, Fluka, Buchs, Switzerland) dissolved (0.1 M) in the TFA solution and a cationizing salt (LiTFA, Sigma-Aldrich) dissolved (0.1 M) in the TFA solution. These solutions were mixed as sample/ matrix/salt ) 10:10:1. An aliquot of 1 µL of this mixture was spotted on the MALDI target for analysis after being air-dried. Software. Home-built routines for data treatment were developed in MATLAB 7 (The Mathworks, Natick, MA). RESULTS AND DISCUSSION Experimental Design with Simulated Data. As described above, the extent of smoothing by the SG algorithm should be balanced in such a way that the noise is maximally removed, while the signal features are kept intact as much as possible. To test how the proposed optimization approach works with respect to these goals, an experimental design was developed. This was carried out with simulated signals, so that way we knew the properties of the signal to retrieve. Gaussian peaks were generated, with different standard deviations (σp) and a maximum arbitrary peak height of 1. Colored noise, which followed the first-order autoregressive equation (eq 4) was added, with different autocorrelation values (F1) and standard deviations (σa). The signals were generated with different sampling rates (sr). Thus, four factors were taken into account, arranged in a full three-level factorial design (34 experiments). The factors varied in the following way (factor, values): σp, (50, 100, 150); F1, (-0.5, 0, 0.5); σa, (0.01, 0.05, 0.1); sr, (0.5, 1, 1.5). Each experiment was repeated 10 times, with different random noise generated in each case. For each repetition, the (noisy) signal was smoothed using the SG algorithm with a second-degree polynomial and different window sizes. Working with simulated data allows reliable measurement of both the absence of signal distortion and the removal of noise when the SG algorithm is applied. The achievement of this double goal was monitored using a Derringer desirability function19 as follows:
Di ) d1,id2,i
(10)
where the subscript i means that the signal was smoothed with an i-point window size, and d1,i and d2,i are the normalized first (19) Derringer, G.; Suich, R. J. Qual. Technol. 1980, 12, 214-219.
Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
4601
Figure 1. Autocorrelation diagrams (panels a-c) and first 100 points of the e vector (panels d-f) of simulated noise according to a first-order autoregressive model (eq 4), with different values of lag-one autocorrelation value and a constant σa of 0.5. Subdiagrams a and d correspond to F1 ) -0.5 (blue noise); subdiagrams b and e correspond to F1 ) 0 (white noise) and subdiagrams c and f correspond to F1 ) 0.5 (red noise). Dots in panels a-c represent the experimental values of autocorrelation (according to eq 3), whereas solid lines correspond to the fitted exponential function using nonlinear regression.
and second individual criteria defined as
d1,i ) 1 d2,i ) 1 -
| |
| |
hsmd,i - h0 h0
(11)
Ssmd,i - S0 S0
(12)
where hsmd,i is the maximum value of the smoothed signal (i.e., the maximum peak height) using an i-point window size, h0 is the theoretical maximum peak height (in this case 1), Ssmd,i is the standard deviation of the vector yres, and S0 is the theoretical standard deviation of the noise of the signal, calculated according to eq 7. The first criterion (d1,i)measures to what extent the first goal is accomplished. Since the smoother flattens the peaks, d1,i will be 1 when there is no peak deformation. The second individual criterion (d2,i) measures the noise removal from the signal compared with the theoretical value. It reaches one when the standard deviation of the removed noise (yres) matches the total noise of the signal. Note that the value of i that yields the maximum value of Di balances minimum deformation of the signal with the optimal noise removal. Unfortunately d1,i is only available with simulated data (the true maximum height should be known), so it cannot be used in practice. In this work, Di is used to measure 4602
Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
how closely the proposed approach comes to the ideal situation. Although Di is not an ideal criterion, it is very suitable for benchmarking purposes. To avoid any confusion, we will call the approach proposed in Selection of the Proper Window Size (the window size yielding F1,res closest to F1,e), approach i, and we will call the ideal approach (the window size yielding the highest Di value according to eq 10), approach ii. The comparison of both approaches is made as follows. For each simulated peak the Di values obtained with both approaches are considered. We define Dopt as the Di value obtained with approach i, and Dref is the Di value obtained with approach ii. The difference between Dref and Dopt, Ddiff ) Dref - Dopt, indicates how much the proposed approach differs from the reference criterion. Finally, the mean of these differences (for the 10 repeated experiments), D h diff, will be a representative value of this deviation at a specific point of the experimental design (i.e., at specific values of σp, F1, σa, sr). The significance of each factor was also evaluated by building a linear model of D h diff as a function of the four factors:
D h diff ) β0 + β1σp + β2F1 + β3σa + β4sr
(13)
where βi are the fitted model parameters. Prior to the computation, both the D h diff and the factor values were normalized. The significance of the parameters was tested by computing the R )
Figure 2. Autocorrelation diagrams (panels a-c) and the first 100 points of the noise vector (panels d-f) of experimental noise of the three detectors studied in this work. Panels a and d, b and d, and c and f correspond to NMR, LC, and MS noise, respectively. Dots in panels a-c represent the experimental values of autocorrelation according to eq 3. Subdiagrams a-c correspond to the enlarged region of the first 15 autocorrelation values, for whom the mean value is also depicted (solid line).
95% confidence interval of βi. Only those values that did not bracket zero were considered significant. Besides β0, two factors were found to have a significant influence on the deviation of the optimal value (Dopt) from the reference value (Dref): the peak width (β1) and noise level (β3). These results seem logical. Although β0 was significant, its value was only 0.012, which means that the deviation of the optimal value from the reference was only of 1.2%. Therefore, one can conclude that, in general, the proposed criterion gives results quite close to the benchmark. In general, the balance between the noise removal and the signal distortion is more difficult to reach for narrow peaks than for broader ones. This is because the SG algorithm has a similar effect as a low-pass filter.17 From this point of view, narrow peaks can be viewed as moderate-frequency signals, whereas broad peaks are low-frequency signals. Thus, the broader the peak, the easier it can be distinguished from the (high-frequency) noise, and the easier an optimal window size of the SG algorithm can be established. It is generally accepted that narrow peaks require smaller window sizes to avoid peak distortion.8,9 Although both approaches (i and ii) adapt the window size to the width of the peak, the effect of the peak width was found to be negatively significant (β1 < 0) for the deviation of the optimal value found with approach i (Dopt) from the reference (Dref). This is because the ideal criterion is more difficult to approach when σp decreases (narrower peaks).
Like the peak width, the noise amplitude of the signal also influences the window size. Again a balance between the noise removal and the signal distortion is more difficult to strike with more noisy signals, since in such cases the extent of smoothing should be increased to remove the noise, worsening the signal. Again, both approaches (i and ii) adapt the window size to the noise level of the signal, but the effect of the noise was found to be positively significant (β3 > 0) in the deviation between the two approaches. The higher the noise amplitude (σa), the more approach i deviates from approach ii. One should note that significant deviations between approach i and approach ii are only found in extreme cases (low signal-tonoise ratios or similar frequencies for noise and signal), in which the benefits of applying SG start to be doubtful. We want to stress that, even in these cases, a significant deviation between approaches i and ii does not mean that the proposed method fails. It merely indicates that the method is imperfect. Autocorrelograms of Simulated and Experimental Data. As explained previously, the first step in the optimization of the SG parameters consists of measuring the autocorrelation function of the noise of the instrument. There are several ways to investigate the noise correlation, depending on which autoregression function applies to the particular instrument. Figure 1, panels a-c, depicts noise correlations that obey the first-order autoregression model (eq 4). The noise was generated according to eq 4, with σa ) 0.5, and with F1 ) -0.5, 0, and 0.5 Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
4603
Figure 3. (Panel a) Plot of F1,res (solid line) vs window size when the SG algorithm is applied to smooth the signal corresponding to a NMR spectrum of a polymeric sample (depicted in panel b). Dashed line represents the Fe of the noise of the instrument. For details, see text. In the plots below, the smoothed signal (top diagrams), first-order derivatives (middle diagrams), and second-order derivatives (bottom diagrams) are depicted for specific window sizes of the SG algorithm. Numbers 1-3 correspond to window sizes of 5, 41, and 117, respectively. The corresponding values of F1,res obtained with these window sizes are plotted in panel a for clarity.
for Figure 1, panels a-c, respectively. Dots represent “experimental points” (values obtained by characterizing the simulated data). The exponential model (eq 5) was fitted to the experimental points. An arbitrary value of ∆t of 1 was used. To avoid the use of numbers with imaginary parts, the parameter fitted was actually F1 instead of τ, using the transformation given in eq 6. The algorithm used for the nonlinear fitting was Brent’s method.20 Figure 1, panels d-f, depicts the first 100 points of the e vectors (of 1000 total length) corresponding to the autocorrelograms depicted in panels a-c, respectively. Notice that the pattern (color) of the noise is different in each plot: Figure 1d corresponds to “blue” noise (F1 < 0), Figure 1e corresponds to “white” noise (F1 ) 0), and Figure 1f corresponds to “red” noise (F1 > 0). The pictures demonstrate that eq 3 constitutes a good alternative to the classical definition of autocorrelation given in ref 8. Indeed, a good agreement between the fitted values (F1 ) -0.48, -0.005, and 0.49) and the theoretical ones was found. In the Appendix, proof of the equivalence of both definitions is given. (20) Press: W. H.; Teukolsky, S. A.; Vetterling, W. T.; Flannery, B. P. Numerical Recipes in C, 2nd ed.; Cambridge University Press: Cambridge, 1992; Chapter 10.
4604 Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
If the first-order autocorrelation model holds, the nonlinear fitting of the autocorrelogram constitutes a good method to estimate F1 values. However, not all instruments yield a noise pattern according to the first-order autocorrelation model. This becomes clear from examining the noise of the instruments tested in this work. In Figure 2, panels a-c, the autocorrelograms of the noise of the NMR, LC, and MS detectors are depicted. Figure 2, panels d-f, depicts the mean-centered noise patterns of the three instruments. As can be seen, no exponential behavior of Fh versus h was observed, which confirms that the first-order autocorrelation function does not hold, at least not for the three instruments tested. This, however, does not mean that the noise of the instruments is uncorrelated. Indeed, in the three cases red noise was observed (F1 > 0), which can be confirmed by observing the noise patterns depicted in Figure 2d-f. Since an exponential function cannot be fitted to the autocorrelation function and since the first 15 points yield approximately constant Fh values (the first 15 points of the autocorrelogram are enlarged in inserts for clarity), the mean of the first 15 autocorrelation values will be used as an estimate of F1. The values of F1 found were 0.60, 0.99, and 0.82 for the NMR, LC, and MS experiments, respectively. It
Figure 4. Same analysis as depicted in Figure 3 for a signal corresponding to chromatogram of a polymeric sample detected using ELSD.
has been reported7 that the autocorrelation values are sensitive baseline drifts as well as baseline offsets. Therefore, the user should be careful in this point in order to calculate the autocorrelation in a part of the baseline free from these artifacts. Alternatively, baseline offsets and drifts may be corrected fitting the data to a straight line. As this is an operation that is performed only once, does not suppose a major problem from the perspective of automation. Optimal Smoothing for NMR, LC, and MS Measurements. It was demonstrated previously that, once the F1 value is known, approach i is sufficiently robust to tackle the optimization of window size with different signal and noise features (e.g., noise amplitude, peak width, or noise autocorrelation). Therefore, once the F1 value of the instrument has been calculated (as described in the previous section), the window size can be optimized in an unsupervised fashion, provided that the instrument conditions (i.e., the autocorrelogram and, consequently, the value of F1) do not change. One should note here that, in principle, F1 does not depend on the magnitude (standard deviation, σa) of the noise but only on the influence that a given measurement at a certain time exerts on the next one. At this stage, despite having studied three different kinds of instruments, we have gained limited experience with autocorrelograms. If the noise auto-correlation does not remain constant during the time of one (or more)
experiments, the F1 value can be recalculated based on (short) blank experiments between the actual measurements. If the value of F1 changes within one experiment, more elaborate strategies may need to be developed. These may include computation of F1 from (several) empty parts of a spectrum or chromatogram. Such more complex applications of the present approach have not yet been investigated. However, they do appear perfectly feasible. Apart from the instrumental settings that can affect the autocorrelogram, special caution should be applied when the data acquisition frequency is changed. Most instruments integrate the signal when a lower data acquisition frequency is selected (in order to increase the signal-to-noise ratio). This in fact works as a low-pass filter, changing the autocorrelogram of the signal, and thus necessitating a recalculation of the F1 value. Figure 3a depicts the lag-one autocorrelation of the residuals (F1,res) when different window sizes are used to smooth the NMR signal (solid line). The horizontal dashed line indicates the lagone autocorrelation for a blank signal (F1,e), calculated according to the previous section. The raw signal is also plotted (Figure 3b). The smoothed signals with different window sizes, together with the first- and second-order derivatives are plotted in Figure 3c. Numbers 1-3 correspond to window sizes of 5, 41, and 117, respectively. Due to the characteristics of the experiment, two Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
4605
Figure 5. Same analysis as depicted in Figure 3 for a signal corresponding to a MS spectrum obtained using MALDI-ToF-MS.
overlapping peaks are expected, which can be better observed using the first- and second-order derivatives. It is clear that a fivepoint window yields a too noisy signal (almost uninformative). This is especially apparent for the second-order derivative. In this case, F1,res is far below the optimal value. A 117-point window seemingly yields a satisfactory result (especially for the secondorder derivative). However, in this case the signal was too much distorted; the peak was flattened by around 10% of its original peak height. The distortion is also much more obvious from the derivative signals. As can be observed, the window size that provides a compromise between noise removal and maintaining signal integrity is the one recommended by approach i, yielding a window size of 41 to match F1,res with F1,e. Figure 4 depicts a similar study as that of Figure 3 but now applied to the chromatographic signal (see the Experimental Section). Figure 4a shows a plot of F1,res versus window size with a magnification of the region of 0.9 < F1,res < 1. Compared to the results found with NMR, the deformation of the signal is in this case less important. With respect to its original height, the smoothed peak is flattened by around 0.05%, 1%, and 2.2% with a window size 5, 93, and 199 points, respectively. However, more significant differences can be observed in the derivatives. Clearly, a five-point window yields too noisy signals, especially if one inspects the second derivative. However, it is difficult to decide which of the other two results (using a window size of 93 or of 4606
Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
199 points) yields the best smoothed signal and derivatives. According to approach i, a window size of 93 points is optimal. However, the F1,res function is seen to approach unity asymptotically. This gives rise to less significant differences between F1,res values for window sizes 93 and 199. From this point of view, one can conclude that the result found with larger window sizes is reasonably satisfactory, since in this region F1,res is close to F1,e, and it is hardly dependent on the window size. The same conclusion can be drawn from the inspection of the results shown in the figure. In Figure 5, the same study is presented for an MS spectrum of two proteins. This case is similar to what was discussed in Figure 3, although the signal corresponds to a completely different instrument. The present approach recommended a window size of 29 points. The same conclusion can be drawn from the figure, in which a five-point window size yields too noisy (secondderivative) signals and a 99-point window size yields too much signal distortion. More detailed figures showing the results are available as Supporting Information. One warning has to be given at this point. We have demonstrated that, from a theoretical perspective, when Fe equals Fres, the window size established with the currently proposed method is optimal. However, it can happen in practice that the informative part of the signal is located too close to the extremes of the data vector, so that the “optimal” window size results in the removal
of informative points. However, this is not a problem of the proposed automation. Rather, it is a general problem of the SG algorithm itself. Even when selecting an appropriate value for w manually, the user would not obtain an optimal value of w that leaves enough points at the extremes. Either information will be lost at the extremes or a too flexible filter will result, which retains too much noise. There is little that automation can offer in this case. The user has to make sure that enough baseline is collected at the extremes of the data sets, to apply SG automatically without problems. Diagnosing the removal of informative data points will be the subject of future investigations. CONCLUSIONS When a signal is smoothed, it is impossible to find a method able to eliminate all the noise without loosing any valuable information. Therefore, a balance must be found between noise removal and signal distortion. A simple method is proposed in this work to find the best compromise between these two goals. The approach is based on a comparison of the residuals of the fitting (i.e., the differences between the input and the smoothed signals) with the noise of the instrument. The smoothing function that yields the autocorrelation of the residuals closest to the autocorrelation of a corresponding blank signal is considered to be optimal. The method was applied to optimize the window size in the SG algorithm, but it can also be applied to optimize other smoothing algorithms. Besides its simplicity, the SG smoother has the advantage of providing the derivatives of the signal without any additional computational cost. To test the robustness of the proposed method, an experimental design was performed in which the peak width, the noise amplitude, the noise autocorrelation, and the sampling frequency were varied systematically. The method proved to be robust. It was shown to find the optimal window size without any user intervention in situations in which changes in these four factors were induced. To quantify the deviations of the proposed approach from the ideal smoothness, a Derringer function that assesses the two goals (signal distortion and noise removal) was considered as a reference. The proposed approach closely matched the ideal criterion (a mean deviation of only 1.2% was found). Apart from this offset, only two factors were found to significantly affect this deviation (i.e., the peak width and the noise amplitude). Both factors affect the balance between noise removal and signal distortion. The lower the peak width and the higher the noise amplitude, the more difficult it is to distinguish between noise and signal. The only a priori information that is needed to apply the proposed method is the autocorrelation of the noise of the instrument. Therefore, in practice the algorithm is applied in two steps. In the first step, the autocorrelation of the noise of the instrument is computed. This can be performed by running a blank sample and by fitting an autocorrelogram to the signal. Since the three different instruments considered did not yield first-order autocorrelated noise, an exponential function could not be fitted to the autocorrelograms. Instead, a near-constant value was found for the correlations with small lag values (F1 through F15) and the mean value of these was used as an estimate of F1. The lag-one
autocorrelation coefficient is used in a second step, in which the raw signal is processed with different window sizes, and the residuals of the smoothing are computed. The window size that yields the autocorrelation of the residuals closest to the autocorrelation of the noise of the instrument is considered optimal. Three different signals were used to test the proposed approach: solid-state NMR, a chromatographic peak detected with an evaporative light-scattering detector, and a spectrum obtained by MALDI-ToF-MS. In all cases, the window size recommended using the proposed approach proved to be a good compromise, which led to little signal distortion and efficient noise removal. As the method is unsupervised, it is suitable for application within more complex computer programs. ACKNOWLEDGMENT This work was supported by projects HPRN-CT-20001-00180 (COM-CHROM, European Union), EX2005-0403 (Spanish Ministry of Science and Education), and MEIFCT-2005-024198 (CHEVAR, European Union). The authors thank Maya Ziari, Wim Decrop, Petra Aarnoutse, and Erwin Kaal for kindly making their data available. SUPPORTING INFORMATION AVAILABLE Three additional figures. This material is available free of charge via the Internet at http://pubs.acs.org. APPENDIX In this Appendix, a demonstration of the validity of eq 3 as a definition of autocorrelation is given. We start with the definition of autocorrelation of Massart et al.7
Fh,m )
1
n-h
(ei - je)(ei+h - je)
i)1
sx2
∑ n-h-1
where Fh,m is the auto-correlation function defined in the mentioned reference, je is the mean of e, and sx2 is the variance of e. However, in our case we have supposed that the vector e is mean-centered. Then n-h
Fh,m )
n-1
n-h
∑
n - h - 1 i)1
(ei)(ei+h) sx2
)
∑ (e )(e i
n-1
i+h)
i)1
n-h-1
≈
n
∑ (e )
2
i
i)1
n-h
n n-h
∑ (e )(e i
i+h)
i)1
n
∑ (e )
2
i
i)1
The last part of the equation is valid for n . 1, or for those values that are intrinsically mean-centered (as noise). Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
4607
h ei) and the end However, in random series, the beginning (∑i)1 n of the series (∑i)n-hei) may be considered equivalent, therefore:
From eq 3: n
Fh,w ) 1 -
1
∑
n
(ei - ei-h)2
i)1+h
2n-h
[ ] [ ] n
) n
Fh,w ) 1 -
∑ (e )
2
i
n n-h
i)1
1-
∑ (e - e i
n
1-
(ei)2
i)n-h
+ Fh,m )
n
∑ (e )
2
i
n-h
1
∑
i+h)
i)1
2
n
i)1
2n-h
n
1-
∑ (e )
2
i
i)1
n
n-h
∑ (e )
n
2
i
-
i)1
∑
(ei)2
i)n-h
n
∑ (e )
+ Fh,m )
2
i
i)1
where Fh,w is the auto-correlation function defined in this work. This yields n-h
n-h
∑ (e )
2
Fh,w ) 1 -
1
n
i
i+h)
i)1
2
-2
i)1
2n-h
∑ (e )(e i
1-
i+h)
∑ (e )
n
n
∑
h
(ei)2 -
i)1
2
∑
2n-h
4608
+ Fh,m
∑ (e )
2
n-h n However, ∑i)1 (ei)2/n - h ≈ ∑i)1 (ei)2/n since both can be definitions of the standard deviation of e. Therefore
n
(ei)2 -
i)1
1-
n
i
Fh,w ) 1
n-h
i
i)1
n
i
1-
n
i)1
i)1
i)1
2
[] ∑ (e )
2
n-h
∑ (e
+
n-h
n
∑
∑
i)n-h
[
n-h
Fh,w ) Fh,m
n-h
(ei)2 - 2
(ei)(ei+h)
i)1
)
n
∑ (e )
2
i
i)1
1-
h
n
∑ (e ) 1
2
i
+
i)1
2
∑
n
∑ (e )
]
(ei)2
i)n-h
2
i
i)1
Analytical Chemistry, Vol. 78, No. 13, July 1, 2006
+ Fh,m
showing that the two definitions are equivalent.
Received for review January 4, 2006. Accepted April 10, 2006. AC0600196