Problems of smoothing differentiation of data by least-squares

Feb 15, 1987 - General least-squares smoothing and differentiation by the convolution (Savitzky-Golay) method. Peter A. Gorry. Analytical Chemistry 19...
0 downloads 0 Views 465KB Size
Anal. Chem. 1987, 5 9 , 654-657

854

838.438

/3

information is utilized instead of one in the linear regression of the peak positions. To make the time warping work, the intensity values in the two chromatograms should agree. If the intensity values vary too greatly, it is probably better to use the retention time information only, as a match using retention time markers would give a better result. The restrictions in time axis should be carefully chosen to meet the experimental conditions. Too restricted conditions will not leave enough room for the time axis to expand or contract. Too broad conditions will not allow convergence at all. The applications of the time-warping algorithm certainly are not limited to the chromatogram matching. As long as an elastic operation is preferred in one axis, time warping can be applied.

LITERATURE CITED t.1

Figure 8. Final tlme-warping function plotted in solM line and the linear function F, plotted in dashed line with the whxlow constraint swounded.

intensity values between the last two-thirds of the two chromatograms. This agreement, which is the goal of the segmented adjustment, greatly enhances the matching. However, as has been mentioned earlier, there is no reason to believe that the intensity values from two different instruments should agree. The results show that the intensities should be made similar in order to achieve a good match by the time-warping algorithm.

CONCLUSIONS ~i~~ warping using c ~ o m a ~ g r a intensity p ~ c infomation to correct the time-axis mismatching has been shown to provide a better match of two related chromatograms than the match using a linear regression method. Two-dimensional

(1) Williams, S. S.;Lam, R. B.; Isenhour, T. L. Anal. Chem. lg83, 55, 1117-1121. (2) Wllllams, S. S.; Lam,R. B.; Sparks, D. T.; Isenhour, T. L.; Hass. J. R. Anal. Chlm. Acta 1982, 138, 1. (3) Williams, S. S. Ph.D. Dissertation, 1984. (4) Greene, W. E.; Williams, S. S.;Isenhour, T. L. Pittsburgh Conference Paper No. 193, 1984. (5) Scmkoff, D., Kruskal, J. B.,Eds. Time Warps, String Edits, and Macromolecules : The Theory and Practice of Sequence Comparison; A d dison-Wesley: Reading, 1983. (6) Sakoe, H.; Chlba, S. I€€€ Trans. Acoust ., Speech, Signal Process. 1978, ASSP-26, 43-49. (7) Rabiner, L. R.; Rosenberg, A. E.; Levinson, S. E. I€€€ Trans. Acoust., Speech, Signal Process. 1978, ASSP-26, 575-582. (8) M i l , K. V. Optimization Methods; Wlley Eastern Limited: New Delhl, 1976.

(9) i i i m e r , D. A.; Ctmttergy, R. Introduction to Nonlinear Optimization ; North-Holland: Amsterdam, 1978. (10) de Haseth, J. A.; Isenhour, T. L. Anal. Chem. 1977, 4 9 , 1977.

RECEIVED for review February 4,1985. Resubmitted August 8, 1986. Accepted October 27, 1986.

Problems of Smoothing and Differentiation of Data by Least-Squares Procedures and Possible Solutions Arshad Khan Chemistry Department, T h e Pennsylvania State University, DuBois, Pennsylvania 15801

Smoothing and dlfferenUatkn of experh\ental data somethes necessitate leastgquares pdlrnomlal ftttlng of a large number of data polnts at a tlme (17, 19, or 21 polnts) rather than 3, 5, or 7 points wlth repeiltlon of the procedure for several Iterations. For a 5- or 7-pdnt fit, one loses smoothlng of only 2 or 3 points, respectively, at each end, and for a 19- or 2lpolnt flt one - 9 or 10 points, respectively, at each end. No other smoothing procedure Is known today to handle the edge polnt smoothing and dtfferentlatlon. I n thls paper a procedure ls suggested for smoothing and dlfferentlatlng essentially every data point. I n addltlon, dtfferent orders of polynomlal fits have been tried for the same data set. I t Is notlced that the parabolic fIt glves the best smoothing of the nolsy data set.

from the experimental data, it is necessary to remove as much of this noise as possible without degrading any underlying information. Several numerical techniques are known today and have been applied for data analysis (1-4). In this paper a smoothing and differentiation of experimental data by a least-squares procedure is described. Special reference is made to the analysis of the molecular beam experimental data which is used to obtain energy distributions (5, 7) of Penning ions (e.g., Hz+)coming from the He(21s)/H2reaction. Here He(2ls) represents a metastable state of He generated by electronic excitation of the ground state He (6, 7). The energy of the metastable He (20.6 eV) is large enough to ionize almost any atom or molecule. The following reaction takes place in the “reaction chamber”, with H2+ions scattered at different angles with respect to H2 beam direction: He(2ls)

In some experiments it is difficult to achieve a satisfactory signal-to-noise ratio because of the superimposition of random noise in experimental data. In order to retrieve information

-

+ Hz

He

+ Hz+ + e-

The detector was set at a certain angle with respect to the ground-state beam (here H2 beam). Data were collected for a certain interval of time ranging from 5 min to 1h depending

0003-2700/87/0359-0654$01.50/00 1987 American Chemical Society

ANALYTICAL CHEMISTRY, VOL. 59, NO. 4, FEBRUARY 15, 1987

upon the counting rate at that angle. In the energy analysis experiments, the scattered ions (H2+)experience a retarding ramp voltage (E') before being detected and recorded by a 100-channel multiscalar (Honeywell SAI-42 A). The detection system, which consists of a mass spectrometer and a spiraltron electron multiplier, has been elaborately discussed elsewhere (5-7). Each channel (bin) number of the multiscalar is related to the energy of the scattered ions. For Hzf scattering experiments, the energy expression is given by

E = (bin number

655

fi

- 17.5) X 0.708 kcal/mol

Counts (signal) at bin numbers from 18 to 98 provided the whole spectrum. The signal recorded at each channel is related to the intensity I ( E ) of the scattered ions by the following expression:

Differentiation of the above equation (Le., data recorded at different channels) provided the energy distribution I ( E ) of the Penning ions. Differentiation without smoothing or after smoothing by a five- or seven-point least-squares fit generated spurious and unacceptable peaks on the distribution plot (i.e., I ( E ) vs. E plot). To get the best distribution, a larger point fit (19 or 21) was necessary. To recover the loss of information from the edge points, which provided the low and the high energy part of distribution, a smoothing procedure was developed and is described in this paper.

THEORY AND DISCUSSIONS Let the fitting function be defined by the following polynomial equation of order m:

y ( x ) = uo + alx

+ u2x2 + ... + u,xm

(1)

The coefficients of the polynomial, ao,al, ...,a, are determined by minimizing the weighted residual function

R = CW,[Y, -Y ( ~ , ) I ~

(2)

Here x , and y, represent the coordinates (here bin number and signal counts, respectively) of the ith data point, and w, = 1/u: is the weighting factor that weights each term of the sum in R according to how large or small a deviation is expected. The summation ranges from N to NPT, where N and NPT are the indexes of the data points between which the fitting is done. NPT = N + NFT - 1, where NFT is the number of data points used on polynomial fitting. By minimization of the residual function with respect to each coefficient of the polynomial, i.e., by setting SR/Ga, = 0 (i = 1, 2, ... m), one obtains a system of m simultaneous equations. This can be written in the form C A = B , where C, A , and B are the following matrices:

.. .

z;W,x,q

B=

As before, the summations are taken over the range from N to NPT. In order to determine A , and hence the coefficients

.*.. 5.

6 1 Flgure 1. Effect of increased NFT on the quality of smoothing. The curves are separated from each other by arbitrary amounts for clarity of viewing. The curves 1 and 2 are obtained by using NFT = 5, ITERATION = 5 and NFT = 7, ITERATION = 3, respectively. I t is quite noticeable that an appreciable amount of information toward each end is lost by the conventional procedure as in curve 3, where NFT = 21, ITERATION = 1. Curve 4, generated by the new procedure, recovers most of the edge point information. Curve 5 represents the original data collected by multiscalar.

of the polynomial (a,'s), it is necessary to invert the matrix C followed by its multiplication with the matrix B. That is

A = C-lB

(4)

In the present analysis o,is considered to be equal to l / y , because of the statistical fluctuations in collections of a finite number of counts (by the multiscalar) in a certain interval of time (8). In the curve-smoothing procedure a segment consisting of a few data points is taken beginning at N = NBEGIN, the index of the very first data point from which the polynomial fitting starts. As mentioned above, by inversion of matrix C and multiplication by matrix B, one gets the coefficients (ao,al, etc.) of the polynomial used in fitting the segment. At the midpoint of the segment a value of y is calculated by using eq 1,which represents the first smoothed value. Then N as well as NPT are increased by 1. A new polynomial function is obtained by fitting the data points between the new values of N and NPT followed by inversion and multiplication of matrices as before. This would provide the second smoothed value. This procedure is continued until NPT becomes equal to NEND, the index of the very last point used in smoothing. The whole procedure can be repeated several times (iterations) by using the smoothed values (y) from the previous fits to obtain greater smoothing. In this procedure of smoothing, a few points at the beginning and a few points at the end are not smoothed. For a five-, seven-, or nine-point segment (NFT = 5, 7 , or 9) only two, three, or four points at each end (this will be termed edge points) are left out and not smoothed. However, for NFT = 19 or 21,9 or 10 points at each end are not smoothed and hence would involve a significant loss of information. In smoothing and differentiation of data points with large errors, a relatively large NFT value (19 or 21) and one iteration is favored over a small NFT (5 or 7 ) and a larger number of iterations for two reasons. First, a larger NFT provides better smoothing. This has been shown in Figure 1. Curve 5 represents the raw data collected by a 100-channel multiscalar (Honeywell SAL42 A) for Hz+ions scattered at a certain angle (OO). Curve 1 represents the smoothed values obtained by

656

ANALYTICAL CHEMISTRY, VOL. 59, NO. 4, FEBRUARY 15, 1987 I

I

1

i-

I-

i

?

-i I

I

I

0

IO

20

30

40

50

J;

ENERGY l K C A L / M O L l

Figure 2. Differentiation (or energy distribution) curves obtained from curves 1, 2, and 3 of Figure 1 are shown. These are also separated by arbitrary amounts. Undesirable structures are noticeable for NFT = 5 or 7 in energy distribuUon curves (1’ and 2’). Low- and high-energy distribution values are missing in curve 3’, obtained by conventional procedure.

using five point segments of the original data (curve 5 ) a t a time and repeating the whole procedure 5 times (NFT = 5 and iterations = 5). Curve 2 represents the smoothed values obtained by using NFT = 7 and iterations = 3 and curve 3 represents the smoothed values for NFT = 21 and iterations = 1. l’,2‘, and 3’ in Figure 2 are obtained by differentiating 1 , 2 , and 3 of Figure 1,respectively, and plotting their negative values. This gives the energy distribution of H2+ions with intensity of ions along the ordinate and energy along the abscissa. These curves are separated from each other for clarity of viewing. While in 1, 1’ and 2, 2’ only a few points are missing (which can be ignored); in 3, 3’ a significant amount of information is lost. In order to identify which of the plots 1, 2, 3 (or l’, 2’, 3’) represents the best smoothed values (or best distribution), the energy distributions l’, 2’, and 3’ are compared with the distribution obtained from a different experiment (5, 6), i.e., a time-of-flight (TOF) experiment which was carried out under the same experimental conditions as that of the multichannel analysis. The TOF velocity (or energy) distribution shows only one peak at about 15 kcal/mol without other structures at low or high energies. This clearly indicates that the low and the high energy structures on 1’and 2’ are not real but come from noise. Hence 1 and 2, from which 1’and 2’ are derived, do not represent well smoothed data sets. On the other hand, 3’ represents a distribution which most closely resembles the TOF distribution and hence 3 represents the most acceptable smoothed values among 1, 2, and 3. Therefore, a large NFT and 1 iteration does a better smoothing job than a small NFT and a larger number of iterations. One can explain these findings by considering the fact that the noise is reduced approximately as the square root of the number of points used at a time (1). Thesecond reason for which a larger NFT and 1 iteration is favored is the CPU time required for various computations. More iterations involve a larger number of matrix inversion and matrix multiplication steps and hence a larger CPU time. For comparison, the ratio of CPU time for NFT = 5 , ITERATION = 5 to NFT = 21 and ITERATION = 1 is about 3:2 when polynomials of the same order are used. In choosing a large value of NFT, one should be very careful not to lose finer structures which may be present together with more

0

10

20 30 40 ENERGY (KCAL/MOL)

50

Figure 3. Effects of different polynomial orders on the differentiation (distribution) values are presented. All such curves are obtained by the new procedure of smoothing and differentiation using NFT = 2 1, ITERATION = 1. Curve 4’ may be compared with 3‘ of Figure 2. In curve 4‘, most of the low- as well as the high-energy information is recovered. Most of the curves presented in Figures 1 through 3, except 6’ (straight line), 7’ (cubical), and 8‘ (quartic), are generated by the parabolic fit.

pronounced structures. This undoubtedly limits the optimum value of NFT. I t has been noticed that any structure arising in differentiation curves (here energy distribution curves in Figures 2 and 3) from noise is fairly sensitive to even a small increment of NFT. Hence, by gradually increasing NFT one can come up with the best value of NFT to get the best smoothing of the data sets. In the analysis of the kind of data presented here, NFT = 19 or 21 and ITERATION = 1 has been found to give the best smoothing (5, 7). In order to obtain a quantitative estimate of errors involved in the smoothing and differentiating procedure, smoothed data sets were generated by taking points from a parabolic function. Smoothing and differentiation of these data by NFT = 21, ITERATION = 1 provided slope values which were then compared with the true values. This provided a systematic error, which was found to be less than 2% in every case. Then errors were superimposed on the original data sets and smoothing and differentiation were done as before. The slope values, thus obtained, were compared with the true values. This provided errors in slope values due to errors on the data sets and systematic error involved in the differentiation procedure. When the superimposed error on data was between 10% and 12%, the error in slope was about 5% and when error on data was fairly large 40% to 60%, the error in slope ranged from 15% to 20%. The same tests were carried out with a smaller NFT value (NFT = 7, ITERATIONS = 3). For a small error (less than 4%) on the data it provided an excellent result (error in slope