Polynomial filters for data sets with outlying or missing observations

charge-coupled device (CCD) detectors make them ideal for the weak signal levels produced in these experiments (6-9). The characteristics of CCD detec...
0 downloads 0 Views 834KB Size
Anal. Chem. 1880, 62, 2351-2357

(17) (18) (19) (20) (21) (22) (23) (24) (25) (26) (27) (28) (29) (30) (31) (32) (33) (34) (35) (36) (37) (38) (39)

Blaha, C. D.; Lane, R. F. Neurosci. Lett. 1987, 78, 199-204. Lane, R. F.; Blaha, C. D. Braln Res. Bull. 1987, 78, 135-138. Lane, R. F.; Blaha, C. D. Brain Res. 1987, 408, 317-320. Lane. R. F.; Blaha, C. D.; Hari, S. P. Brain Res. Bull. 1987, 79, 19-27. Broderick, P. A. Life Sci. 1985,36, 2269-2275. Broderlck, P. A. Neurosci. Lett. 1988, 95, 275-280. Yamamoto. B. K.: Lane. R. F.: Freed. C. R. Life Sci. 1882. 30. 2 155-2 162. Broderick, P. A. NeuropeptMes 1987, 70, 369-386. Signs, S.A.; Yamomoto, B. K.; Schechter, M. D. Neuropharmacology 1887, 26, 1653-1656. Yamamoto, B. K.: Spanos, L. J. Eur. J. Pharmacol. 1988, 748, 195-203. Glynn, G. E.; Yamamoto, B. K. Brain Res. 1989, 487, 235-241. Phillips. A. G.; Pfaus, J. G.; Blahm, C. D.; Fibiger, H. C. J. Neurosci. Methods 1989, 2 9 , 275. Gelbert, M. B.; Curran, D. J. Anal. Chem. 1988, 58, 1028-1032. Crespi, F.; Martin, K. F.; Marsden, C. A. Neuroscience 1988, 27, 885-896. Ormonde, D. E.; O'Neill, R . D. J. Electroanal. Chem. InterfacialNectrochem. 1989,26 7 , 463-469. Lyne, P. D.; O'Neili, R. D. Anal. Chem. 1989, 67, 2323-2324. O'Neill. R. D.; Fillenz, M.; Albery, W. J.; Goddard, N. J. Neuroscience 1983. 9 , 87-93. Saralathan, M.; Osteryoung R.; Osteryoung, J. J. Nectroanal. Chem. Interfacial Electrochem. 1986,214, 141-156. Penczek, M.; Stojek, 2 . J. €lectroanal. Chem. Interfacial Electrochem. 1987,227, 271-274. Nicholson, R. S . Anal. Chem. 1985,37, 1351-1355. Rice, M. E.; Galus, 2.; Adams, R. N. J. Nectroanal. Chem. Interfacial Nectrochem. 1883. 743, 89-102. Oldham, K. B. J. .€/echoanal. Chem. Interfacial Electrochem. 1985, 784, 257-267. Bard, A. J.: Fauikner, L. R. Nectrochemicai Methods; John Wiley & Sons: New York, 1980; p 13.

235 1

(40) Reinis, S.;Goldman, J. M. The Chemistry of Behavior, Plenum Press: New York, 1982 p 7. (41) Deakin, M. R.; Kovach, P. M.; Stutts, K. J.; Wightman, R. M. Anal. Chem. 1988, 58, 1474-1480. (42) Deakin, M. R.; Stutts, K. J.; Wightman, R. M. J. Nectroanal. Chem. Interfacial Nectrochem. 1985, 182 113-122. (43) Albahadlly, F. N.; Mottola, H. A. Anal. Chem. 1987, 59, 958-962. (44) O'Neiil, R. D.; Fillenz, M. Neurosci. Lett. 1985, 60,331-336. (45) Boutelle, M. G.: Svensson, L.; Fillenz, M. Neuroscience 1989, 30, 11-17. (46) Ciemens, J. A.; Phebus, L. A. Life Sci. 1984,35, 671-677. (47) Brose, N.; ONeill, R. D.; Boutelle, M. G.; Fillenz, M. NeuroDharmacologYl989, 28, 509-514. (48) Louilot, A,; Gonon, F.; Buda, M.; Simon, H.; Le Moal, M.; Pujol, J. F. Brain Res. 253-263. . lg85. ...., 336. .- ., . - -. . (49) Wilson,. R. L.; Kamata, K.; Bigelow, J. C.; Rebec, G. V.; Wightman, R. M. Brain Res. 1986, 370, 393-396. (50) Gonzalez-Mora, J. L.; Sanchez-Bruno, J. A.; Mas, M. Neurosci. Lett. 1988, 86,61-66 (51) Mueller, K.; Haskett, C. Pharmacol. Biochem. Behav. 1987, 27, 231-234. (52) O'Neill. R. D.; Grunewald, R. A.; Filienz, M.; Albery, W. J. Neuroscience 1982, 7 , 1945-1954. (53) Gonon, F.;Fombarlet, C. M.; Buda, M.; Pujol, J. F. Anal. Chem. 1981, 53, 1386-1389. (54) Aibery, W. J.; Goddard, N. J.; Beck, T. W.; Fillenz, M.; O'Neili, R. D. J. Nectroanal. Chem. Interfacial Nectrochem. 1984, 767, 221-233. (55) Cheng, H. Y. J. Elechoanal. Chem. Interfacial Electrochem. 1982, 735, 145-151.

RECEIVED for review January 2,1990. Accepted July 11,1990. We thank EOLAS for a grant to P.D.L. under the Basic Research Awards scheme.

Polynomial Filters for Data Sets with Outlying or Missing Observations: Application to Charge-Coupled-Device-Detected Raman Spectra Contaminated by Cosmic Rays G. R. Phillips and J. M. Harris* Department of Chemistry, University of Utah, Salt Lake City, Utah 84112

Occasionally data sets contain points with excess error so that no information about the underlying signal is available from these observations. Such points should be considered lost and be given no influence on the results. Polynomial filters suitable for smoothing data sets containing outlying or missing observations are presented. Outliers are first identified by their deviation from the trends of the surrounding data, relative to a robust estimate of the standard deviation. I n the vicinity of an outlying point thus identified, the missing-point filter employs a polynomial flt to the points surrounding, but not including, the outlying or missing point. I n the absence of outliers, the procedure reverts to ordinary Savitzky-Goiay polynomial filtering. The equations for generating these filters and predicting their statistical properties are derived. Their performance is evaluated on chargetoupied-device-detected Raman spectra contaminated with cosmic ray events and is compared wlth nonlinear smoothing algorithms. The linear, missing-point polynomial filters are found to be efficient, both in speed and in ability to bridge gaps in data with minimal distortion.

Least-squares polynomial digital filters, introduced by Savitzky and Golay (I), are frequently used to remove high-

* To whom correspondence should b e addressed.

frequency noise and enhance the signal-to-noise ratio of spectroscopic or time-dependent data (2). Errors in the original smoothing coefficients published by Savitzky and Golay were corrected by Steiner (3),while Madden (4) published equations for calculating the convoluting coefficients. Polynomial filters are moving weighted averages computed by convoluting a set of regression-generated filter coefficients with a moving data window (5). These filters can be applied to any data set of equally spaced observations with locally constant variance (i.e. within the filter width). For data with normally distributed errors, least-squares polynomial filters are maximum likelihood estimators. The efficiency of least-squares estimators is severely degraded by the presence of deviant observations, as shown by the distortion of spectra containing outliers when smoothed by Savitzky-Golay filters. Outliers are a common problem in analytical techniques such as Raman spectroscopy that utilize ultrasensitive detection methods. The high sensitivity and low noise of charge-coupled device (CCD) detectors make them ideal for the weak signal levels produced in these experiments (6-9). The characteristics of CCD detectors have been recently reviewed (10-11). A significant problem with CCD detectors is their sensitivity to cosmic rays (6-12). Cosmic ray events produce enormous charge spikes ( 103-104 photoelectrons), which cause large distortions in the spectrum when ordinary linear smoothing procedures are used. Outliers in Raman spectroscopy may also arise from particles or bubbles in the

0003-2700/90/0362-2351$02.50/00 1990 American Chemical Society

2352

ANALYTICAL CHEMISTRY, VOL. 62, NO. 21, NOVEMBER 1, 1990

focused laser beam. Whatever the cause, outliers should be removed before background subtraction, peak fitting, or other data processing. In this work, a linear smoothing algorithm suitable for outlier-contaminated data is presented. The method is also effective in bridging gaps in data sets where observations are missing or lost. In the vicinity of an outlying or missing point, filters similar in design to those of Savitzky and Golay ( I ) employ a polynomial fit to the points surrounding, but not including, the outlying or missing point. In the absence of outliers, the procedure reduces to a Savitzky-Golay polynomial filter. After equations for generating these filters and predicting their precision are derived, their performance is tested on CCD-detected Raman spectra contaminated with cosmic ray events; the results are compared with those produced by two nonlinear smoothing algorithms proposed by Hillig and Morris (13)and Bussian and Hardle (14). The linear, missing-point polynomial filters are found to be efficient, both in speed and ease of implementation and in their ability to bridge gaps in data with minimal distortion.

EXPERIMENTAL SECTION The linear polynomial filters (and the comparison nonlinear smoothing algorithms) were written in Microsoft FORTRAN and executed on a PC-AT compatible computer with an 80827 math coprocessor using subroutines from the GRAFMATIC library (Microcompatibles, Inc., Silver Spring, MD). Analytical equations for filter coefficients were derived on a VAX 8300 computer by using the algebraic programming language REDUCE (Rand Corp., Santa Monica, CA.). Raman spectra of 0.4 mM ferriprotoporphyrin chloride (hemin) in aqueous solution buffered at pH 7 were produced by using 514.5-nm incident laser radiation and a CCD (Thomson, Model CSF THF7882CDA) detector housed in a cryogenic dewar (Photometrics, Model CH210) cooled to -115 "C. The two-dimensional (384 columns by 576 rows) CCD was positioned such that the spectrum was dispersed along the horizontal (576 channel) detector axis and photoelectrons were binned (summed) along the vertical axis to minimize readout noise. The complete Raman instrument has been described elsewhere (15). RESULTS AND DISCUSSION Missing-Point Polynomial Filters. Occasionally data sets contain a few erroneous points, contaminating an otherwise usable spectrum or other data record. If the error in these points is so large that no information about the underlying signal is available from the observation, then the points should be considered missing or lost and should have no influence on the results. In these cases, a desirable smoothing procedure should automatically detect and replace these outliers with a statistically valid estimate of the point derived from adjacent data. The missing-point filtering method described in this section accomplishes this goal. Presently, the only restrictions with the method are that the data be evenly spaced, that any spike or outlier occupy no more than two adjacent data channels, and that such outlying events be separated by at least one filter width. The latter two restrictions can easily be relaxed with further development, as discussed below. Outliers in the data are identified by first smoothing with a quadratic Savitzky-Golay filter. Standardized residuals are formed, where y r and 9, denote the original and filtered points, respectively. The standard deviation, s, is estimated from the inner quartile range (16) by s =

(1175

- 1125)/1.3490

(2)

where c(25 and p'75are the 25th and 75th percentiles (or first and third quartiles) of the distribution of residuals. This estimate of the standard deviation requires sorting the residuals but is insensitive to a small number of outliers, in

contrast to the more familiar root mean square estimate of the standard deviation (16). Observations whose standardized residuals are greater than some constant, r i / s > c, are considered to be outliers (note: we are only testing for positive outliers in the present study since Cosmic ray events and particles in the sample generate excess charge in the detected spectrum). For this work, a value, c = 3.5, was chosen to reject points outside the 99.98% one-sided confidence interval of a normal error distribution. Normalization of the residuals by an estimate of the standard deviation means that c does not require adjustment as the noise in the spectrum changes, eliminating the need for the subjective selection of error parameters. For an isolated outlier or spike, ys, identified by the above procedure with a filter width equal to 2m + 1,all the filtered points within the interval ySwm to ys+mhave been contaminated by the outlier during the Savitzky-Golay smooth and must be recalculated. Each of the contaminated intensities f s + k , k = -m, ..., m, is replaced by a quadratic polynomial leastsquares fit to the m points on each side of y s + k , excluding ys. These least-squares estimates are computed by using the filter or convoluting coefficients C 2 , k , L m

gs+k

= j=-m

(3)

C2,k,iYs+k-i

for k = -m, ...,m. The convoluting coefficient for the outlier, C z , k , i = k , should be zero in all 2m + 1 sets of coefficients. The convoluting coefficients are computed by using a matrix formulation (3,17-19) of least-squares polynomial filters. This approach t,o the polynomial regression problem (20,21)simplifies the design of new filters for initial point estimation, for example (17), and allows the statistical properties of polynomial filters to be predicted from first principles (18, 19). For quadratic filtering, the best fit quadratic function is defined by the n + 1 = 3 parameters, Pnj, which best fit 2m + 1 simultaneous equations, one equation for each point within the 2m + 1 data segment, of the form

where ti is the difference between the observation signal and the polynomial fit. Since there are more observations than parameters, this is an overdetermined problem that can be solved by least squares. The set of 2m 1 equations represented by eq 4 can be expressed in matrix notation, where letters capped by an arrow denote column vectors and boldface letters denote matrices.

+

Y'=XP+Z

(5)

In this equation, y' contains the-2m + 1 observations of the dependent variable, the vector /3 contains the n + 1 = 3 unknown parameters, and 7 is a vector of 2m 1 errors (in the absence of any deviation between the underlying signal and the polynomial model, this vector will contain the residuals, rj). The design matrix X has dimensions 2m + 1by n + 1 and coqtains the ilfactors from eq 4. The least-squares estimate of p, denoted b, is obtained by minimizing the sum of squares of error and is given by (20, 21)

+

From eq 4, one sees that the value of the polynomial at i = 0 (midpoint of the 2m + 1 data segment) is just the zerothorder parameter, &,. Combining eq 2 and 6 with this observation indicates that the convoluting coefficients, Czpj,for quadratic midpoint smoothing are found in the first row of the matrix T = (XtX)-lXt. The problem of having an outlier in the data vecto;, y', is solved by eliminating its influence on the estimate of b in eq

ANALYTICAL CHEMISTRY, VOL. 62, NO. 21, NOVEMBER 1, 1990

2353

Table I. Quadratic Missing-Point Filter CZ,k,i,l i

-3

-2

-1

0

t1

+2

-588

1164

3276

3780

2172

14112

14112

14112

--3948

14112

14112

14112

-1764

5544

7308

7812

7056

5040

29988

29988

29988

29988

29988

29988

29988

29988

6084

1950 _-

--

-4

+3

t4

k -4

0

---

-3

-1872

--

-2

33228 -2844

---

-1

31932

3354

----

3084

7272

31932

31932

2520

7020

--

-3780

0

91 26

9672

33228

33228

----

33228

---

10428

--

31932

9720

n

8658

--_

33228 9396

---

31932 9720

----

30960 -4140

+1

31932

-3744

----

- ---

-2172

29988

-1164

t4

9396

31932

33228

+3

6624

---

---

+2

2112

----- --

I -

14112

1950

33228 1764

--

29988

31932

31932

6084

8658

33228

33228

5040

7056

29988

29988

2172

3780 --

- ---

924

14112

_ I _

14112

14112

6. This is easily accomplished by replacing the corresponding row in the design matrix, X, by zeros. For example, for a quadratic fit of a 9-point segment where the point at k = -2 (or two increments before the midpoint) has been identified as an outlier, the design matrix, Xn,k would be given by

i o

1

2

k -4

4

1

0

0

2 3 4

16

:I

10428

U

__I

In order to estimate all of the points within the interval k = -m, ..., m or -4, ..., 4 of the outlier, 2m + 1 or nine of these design matrices must be constructed and the convoluting coefficients determined by solving for the first row of the matrix T = (XLX)-'Xt. Convoluting coefficients, thus derived, for quadratic filtering of points within an interval of f4 from a single outlier are listed in Table I. A similar approach is used when a spike event contaminates two adjacent points in the data. Here, two rows of the design matrix are replaced by zeros (except a t the ends of the filter, k = -m and m + 1, where only one element of the spike event overlaps the filter, and hence only one row of zeros is needed). The convoluting coefficients appropriate for two adjacent outlying points are listed in Table I1 for 9-point quadratic filtering. To build other missing-point quadratic filters of arbitrary width, analytical expressions for the coefficients of these filters were derived from eq 6 for a design matrix having an arbitrary number of rows, 2m + 1.

31932 9672

---

33228 7812

29988 3948

--

14112

--

33228 6624

31932

924 _--

14112 1764

-_--

33228 2112

---

31932

-4140

--

31 932

30960

30960

3084

-2844

31932

31932

3354

-1872

33228

33228

7272

---

---

-----

33228

-3780

--

--

5544

- 1 764

29988

29988

7308 ------

14112

33228

30960

9126

3276

-3744

2520

--

--

-2772

-----

7020

-- --

3 1 932

29988

- 1 764 -14112

1764

--

14112

-588 -14112

0

These expressions are listed in Table 111;included in the table are filter coefficients for a spike consisting of a single channel or two adjacent channels. The design of missing-point filters is not limited to quadratic fitting functions or to two adjacent outliers. Higher-order polynomial filtering can be implemented by expanding the number of columns in the design matrix; similarly, other patterns of missing or outlying points can be accommodated by choosing different rows of the design matrix to replace with zeros. Smoothing Performance. The matrix derivation for least-squares polynomial filters can also be used to predict the statistical consequences of applying these filters to noisy data (18, 19). The elements of the matrix V = uo2(XtX)~' predict the variance and covariance of the parameters, b, estimated by eq 5 (20,21),where u: is the variance of the noise in the raw data. The first diagonal element of V contains the expected variance of b,,o, the estimated or smoothed value at the origin or midpoint, i = 0. This method was applied to the design matrices for missing-point quadratic filters like the example shown in eq 7 . Dividing the predicted variance of the smoothed midpoint by the variance of the raw data, u z , and taking the square root gives the relative. change in standard deviation due to the filtering process, as plotted in Figure 1 for 9-point quadratic filters for one or two missing points. The results show that the greatest uncertainty is in estimating the missing point itself, k = 0. For a single missing point, the standard deviation of this estimate is