Signal Smoothing with PLS Regression - Analytical Chemistry (ACS

Publication Date (Web): April 5, 2018 ... Depending on the analytical task various requirements can be presented to ... The number of methods is using...
0 downloads 0 Views 2MB Size
Article Cite This: Anal. Chem. 2018, 90, 5959−5964

pubs.acs.org/ac

Signal Smoothing with PLS Regression Vitaly Panchuk,†,‡,§ Valentin Semenov,†,§ Andrey Legin,†,‡ and Dmitry Kirsanov*,†,‡ †

Institute of Chemistry, St. Petersburg State University, St. Petersburg, Russia 199034 Laboratory of Artificial Sensory Systems, ITMO University, St. Petersburg, Russia 197101 § Institute for Analytical Instrumentation RAS, St. Petersburg, Russia 198095 ‡

S Supporting Information *

ABSTRACT: Smoothing of instrumental signals is an important prerequisite in data processing. Various smoothing methods were suggested through the last decades each having their own benefits and drawbacks. Most of the filtering methods are based on averaging in a certain window (e.g., SavitzkyGolay) or on frequency-domain representation (e.g., Fourier filtering). The present study introduces novel approach to signal filtering based on signal variance through PLS (projections on latent structures) regression. The influence of filtering parameters on the smoothed spectrum is explained and real world examples are shown.

S

In spite of the fact that signal smoothing is successfully used in various analytical domains, it should be handled with care when multivariate data processing is in mind, since it may introduce correlated structure and signal distortion leading to the deterioration of model quality.13 Partial least-squares (PLS) regression is one of the most popular tools for multivariate calibration in chemometrics.14 It is typically used to relate the response of multichannel analytical instruments with certain features of samples, concentration of analytes, integral quality parameters. The mathematics behind the PLS is based on variance structure analysis in the data and as such provides the way for sorting the signals according to their contribution into the total data variance. The analytical signals will typically have higher variance compared to baseline and this feature can be employed for data smoothing. The present study is devoted to the exploration of PLS potential for signal smoothing and addresses several real cases to demonstrate the feasibility of the approach. The paper is organized as follows: in the theoretic section the idea and the mathematics of the method will be explained, then it will be applied for qualitative analysis in Mössbauer spectrometry, and finally the applicability in quantitative analysis in X-ray fluorescence spectrometry will be demonstrated.

ignal smoothing is one of the most important steps in data preprocessing. The purpose of smoothing is in improvement of the obtained data quality in order to achieve better precision in qualitative and quantitative analysis. The number of smoothing methods is quite large and growing. This is due to the fact that all methods distort to some extent the parameters of signals and the higher is the signal-to-noise improvement the more distorted the smoothed line is. Depending on the analytical task various requirements can be presented to the smoothing method. In the case of qualitative data analysis, it is important to preserve the particular position of the signal and its width, otherwise incorrect identification of the analyte may occur. When qualitative analysis is in question the signal intensity and the area under the peak are crucial for accurate quantification of the target substance. Smoothing methods can be based on various principles. The number of methods is using signal averaging in a certain spectral window (group of moving average and median filters,1 Savitzky-Golay,2 etc.). There are methods based on frequency domain representation (Fourier1 and Wiener filtering,3 wavelet,1 etc.). A group of methods is using signal approximation with appropriate mathematical function (e.g., penalized leastsquares,4 etc.). Signal smoothing (noise filtering) is widely applied almost for any type of analytical data: in chromatography,5,6 in molecular spectroscopy,7,8 in atomic spectroscopy,9,10 etc. It is noteworthy that smoothing procedures can be applied not only to the raw analytical signals but also as an intermediate step in chemometric modeling, e.g., for smoothing of loadings weights vectors in multivariate regression11 to improve model performance, for smoothing of regression coefficients, and12 for multivariate calibration transfer. © 2018 American Chemical Society

Received: March 16, 2018 Accepted: April 5, 2018 Published: April 5, 2018 5959

DOI: 10.1021/acs.analchem.8b01194 Anal. Chem. 2018, 90, 5959−5964

Article

Analytical Chemistry



In this figure vector Y is the raw spectrum which has to be filtered. The matrix X is composed from single pure model line signals, each X row contains only one line and the position of this line is changing through the matrix. Thus, matrix X is the basis to decompose Y into single lines. The weight of each single line is determined with corresponding regression coefficient from the vector BLV. Here vector BLV is being found through PLS regression procedure. The product of X and BLV yield smoothed spectrum. Several points in the beginning and in the end of the smoothed spectra can be distorted as the first and the last lines in the decomposition basis X contain only the parts of model line shape. BLV (and consequently the smoothed spectrum) depends on the number of LVs and on the shape and the parameters of single model lines employed in X. The choice of particular parameters including LV number depends on the particular preprocessing task. If one is aiming at complete noise suppression than small LV number is recommended. If the purpose is in the exact reconstruction of a peak shape than the higher LV number should be employed. Various validation methods employed for determination of the optimal LV number in PLS models seem not to be capable of providing some useful information here, since the X matrix is composed from individual model lines positioned along the wavelength scale. In this case validation will result in prediction of particular spectral point using the data without its model line. The particular shape of the model peak also depends on the task in mind. The more narrow it is the less noise will be removed in the smoothed data. At the same time narrow model lines will provide for unaltered spectral line shapes in the smoothed data. The broader the model line the more the noise suppression is (at the price of spectral line distortion). Initial assessment of model line width can be done as equal to the real width of the spectral line. The amplitude of the model line peak has an influence on the scale of regression coefficients only.

THEORY PLS regression is well documented in the chemometric literature and detailed description can be found elsewhere.14 This method is based on decomposition of both independent predictors (analytical signals collected from calibration sample set) and target parameters (e.g., concentrations of target analytes in calibration sample set) matrices X and Y correspondingly into a new latent variable (LV) space. These LVs are being drawn in the direction of variance in the data. The B vector (vector of regression coefficients) converts independent predictors into target parameters and it depends on the number of LV. First LVs are connected with the maximal sources of variance (correlated with Y) and the following LVs are associated with the smaller contributions. In case of spectral data typical measurement noise can be considered as a source of minor variance compared to that of meaningful spectral bands. Thus, varying the number of LVs in calculation of B, one can separate spectral noise from the signal. The suggested PLS smoother approach is illustrated in Figure 1.

Figure 1. PLS smoother concept.

Figure 2. Mössbauer spectra of 6 μm α-Fe foil, magnetite (Fe3O4), and iron containing ore acquired with different SNR. 5960

DOI: 10.1021/acs.analchem.8b01194 Anal. Chem. 2018, 90, 5959−5964

Article

Analytical Chemistry



EXPERIMENTAL SECTION The suggested approach was tested with Mössbauer and X-ray fluorescence (XRF) data. In the first case we studied the influence of parameters (number of LV and model line shape) on the resulted smoothed spectrum. In the second case the PLS filter applicability for quantitative calibration model improvement is demonstrated and compared with that of SavitzkyGolay filter. In order to implement the proposed smoothing procedure, the special program code was written in C# using standard NIPALS (nonlinear iterative partial least-squares algorithm).14 Mössbauer Data. Mössbauer spectra were chosen due to the following considerations: (1) possibility of acquiring the spectra with predefined signal-to-noise ratio (SNR) by varying the spectra accumulation time and (2) known line shape (Lorenzian peak). These issues give an opportunity to estimate the possible distortion of the line parameters in the smoothed spectra. Mössbauer spectra were acquired with WissEl (Wissenschaftliche Elektronik GmbH) spectrometer in constant acceleration mode with 57Co (Rh) Mössbauer source at room temperature. The spectra were processed through their fitting with the set of single Lorenzian peaks by the Levenberg− Marquardt algorithm. As a result the following peak parameters were extracted: amplitude (in relative units), width, and position (both in number of channels). The measured samples were: 6 μm α-Fe foil; magnetite (Fe3O4) and iron containing ore. The sample amount does not exceed 10 mg/cm2 in order to avoid spectral shape distortion due to saturation effects. Various spectra accumulation times were employed to get various SNR for the least intensive peak (3, 8, 14, 27, >30). Figure 2 shows Mössbauer spectra of three samples acquired with different accumulation time and thus having different SNR: 3, 8, and >30. The spectra with SNR > 30 were employed for estimation of parameters of the least intensive reference line. These were line amplitude, line width, and line position. Table 1 shows calculated values of “ideal” parameters.

the individual lanthanide spectra besides cerium was observed. Thus, the direct quantification of particular individual lanthanides from these data was cumbersome. In this study PLS smoothing was applied to the spectra and after that the concentration of cerium was considered as a target parameter for quantification. Supplementary data for this report contains a MS Excel file with several Mössbauer spectra and the whole EDX data set together with their decomposition matrices to encourage the readers to study the filter performance in details.



RESULTS AND DISCUSSION Mössbauer Spectroscopy Data. The main parameters of PLS smoothing procedure determining the quality of the final spectrum are the number of LV in PLS decomposition and the shape of the basis line (single pure model line in X). In order to illustrate this thesis let us consider Mössbauer spectra of α-Fe (SNR = 3) smoothed with various number of LV and various parameters of the basis line (Figure 3). The basis line was normalized Lorenzian line, thus the only parameter to tune is the Lorenzian width. When the number of LV = 1, the amplitudes are much lower than they should be and spectral widths are much broader, while SNR is the highest compared to that with other numbers of LVs. With the increase of the LV number (LV = 2, 5) the distortion of spectral parameters is lower, but the smoothed spectral quality decreases. With LV = 20 the smoothed spectra is almost equivalent to the raw one. This tendency is due to the fact that several first PLS components are connected with the highest variance in the data (normally associated with analytical signal), while higher LVs are connected with minor variance mainly associated with spectral noise. Obviously, first LV cannot accommodate the whole peculiarities of the analytical signal, thus signal shape distortion is observed. The opposite trend can be seen for the basis line width: the higher it is, the more distorted is the smoothed signal. The narrowest basis line provided for the smoothed spectrum which is almost identical to the raw one. Thus, each particular task requires a careful choice of PLS smoothing parameters. In Mössbauer spectroscopy it is important to keep spectral line parameters (amplitude, width, and position) unaltered during all data modifications in order to provide for unbiased conclusions on sample composition. The PLS smoothing parameters were optimized to yield minimal distortion of amplitude, width, and position values while keeping the maximal SNR. As a case study, the Mössbauer spectrum of αFe with SNR = 8 was considered. Lorenzian and Gaussian basis line shapes were applied for PLS smoothing. The line position was not altered regardless of the filtering parameters, while both amplitude and width depend a lot on these parameters. With a single LV a significant broadening of the filtered line width and suppression of the line amplitude were observed. These effects are more pronounced when the basis line width is higher and they reach the maximum when basis line is broader than spectral line (in this case the real width is 4.77, Table 1). Gaussian shape of the basis line provides for lower broadening than that of Lorenzian. This is due to the fact that the Gaussian line is sharper and fits the spectrum better. Taking more LVs into account leads to smaller distortion of line parameters. In the case of the Gaussian line shape with width of 2.5 and 4, LV = 3 already provides for distortion below 10% and SNR twice better than the initial one. In the case of the Lorenzian basis, the same performance can be observed only at a higher number

Table 1. Reference Parameters of Mössbauer Spectral Line α-Fe Fe3O4 Ore

amplitude, A (r.u.)

width, w (channel)

position, IS (channel)

0.0236 ± 0.0002 0.039 ± 0.0001 0.1338 ± 0.0104

4.776 ± 0.005 5.628 ± 0.004 26.843 ± 2.328

185.184 ± 0.002 156.536 ± 0.003 230.286 ± 0.834

The α-Fe Mössbauer spectrum consists of six freestanding nonoverlapped lines (sextet). Fe3O4 spectrum is a superposition of two sextets, and all lines are more or less overlapped. The iron containing ore spectrum is a superposition of three strongly overlapped doublets. The peak width in the case of the ore is larger than that for two other cases (Table 1). XRF Data. To demonstrate the potential of the PLS smoothing in quantitative analysis, noisy data from 40 lanthanide mixtures obtained with energy-dispersive X-ray fluorescence spectrometer Shimadzu EDX-800HS were employed. The mixtures contained six lanthanides: Ce, Pr, Nd, Sm, Eu, Gd. Lanthanide concentrations in mixtures were varied in the range from 10−6 to 10−3 mol/L. The data set was taken from the study.15 The lanthanide concentrations were quite low and SNR was rather low too. Moreover significant overlap of 5961

DOI: 10.1021/acs.analchem.8b01194 Anal. Chem. 2018, 90, 5959−5964

Article

Analytical Chemistry

Figure 3. Mössbauer spectra of α-Fe (SNR = 3) smoothed with various number of LVs (width = 5 channels) and various width of the basis Lorenzian line (LV = 2).

Table 2. SNR Values of Mössbauer Spectra after PLS Smoothing SNR after smoothing initial SNR

α-Fe

Fe3O4

Ore

3 8 14 27

5 17 23 52

8 14 25 60

6 19 29 55

Figure 5. Difference spectra between raw data and 1, Savitzky-Golay, window = 5; 2, FFT filter (“low pass”, cutoff frequency = 8 Hz); 3, PenLS filter, λ = 2; 4, PLS filter.

of LVs and small basis line width, while SNR is close to the initial value. Table S1 (Supporting Information) contains the detailed results of this part of the study. The optimized parameters of the filter (Gaussian width = 4, LV = 3) were applied for smoothing of all acquired Mössbauer spectra except for iron containing ore (Gaussian width = 20, LV = 3). The SNR values for filtered data are given in Table 2. Modification of regression coefficients B derived from PLS modeling allows for additional denoising. As an example Figure

Figure 4. EDX spectra of lanthanide mixtures: 1, raw data; 2, smoothed with Savitzky-Golay (window widths 5); 3, smoothed with PLS filter.

5962

DOI: 10.1021/acs.analchem.8b01194 Anal. Chem. 2018, 90, 5959−5964

Article

Analytical Chemistry

Figure 6. OLS models for determination of cerium (a) in the whole concentration range and (b) in the low concentration range.

Table 3. r-Pearson and RSS Values for Calibration Plots Obtained with Raw and Smoothed Data raw

SG5

r-Pearson RSS

0.96 6.7 × 10−6

0.96 5.0 ×

r-Pearson RSS

0.38 4.3 × 10−6

0.25 3.9 ×

FFT

Whole Concentration Range 0.98 10−6 2.8 × 10−6 Low Concentration Range 0.38 10−6 2.3 × 10−6

PenLS

PLS

0.97 3.2 × 10−6

0.98 3.1 × 10−6

0.29 2.6 × 10−6

0.39 2.4 × 10−6

Difference spectra show that SG filter with window = 5 induces line shape distortion and certain spectral structure can be seen. FFT, PenLS, and PLS do not distort the spectra significantly; however, in the case of FFT the periodic structure appears at the ends of the spectra (Figure S3). The application of PenLS and PLS smoothing yields rather uniform difference spectra with small amplitude. The smoothed and raw spectra were employed for quantitative determination of cerium. The intensity of Lα line of cerium (4.84 keV) was related to the cerium concentration using ordinary least-squares regression (OLS) in two concentration ranges separately. The first model covered the whole studied concentration range of cerium (10−6 to 10−3 mol/L). The second model addressed only the low concentration range (10−6 to 1.7 × 10−4 mol/L) at the detection limit of the instrument. Figure 6 shows the calibration plots for these two ranges derived with SG and PLS. Calibration lines in case of PLS filter are above those for SG filter due to the intensity suppression in the latter case. Figure S4 shows calibration plots for all studied filters in the whole concentration range. The FFT calibration line is the closest one to that produced with the raw data. The largest intensity suppression can be observed for the SG filter, while PenLS and PLS yield similar results. However, in case of PenLS certain deviation in the slope of the line can be observed (Figure S4). The general quality of fit was estimated using Pearson correlation coefficient (r-Pearson) and residual sum of squares (RSS). These parameters for the studied filters are given in Table 3. When OLS model was constructed for the whole concentration range all studied filters provided for improvement of calibration model quality. The smallest improvement in RSS was observed for SG filtering, the largest one for FFT. The results of PenLS and PLS were similar. RSS improvement in the case of low concentrations followed the same trend. r-Pearson values for low concentrations varied significantly between the filters; however, the correlation itself is quite low since the determined Ce concentrations in this case are near the

S1 (Supporting Information) shows the PLS-smoothed spectra of α-Fe where regression coefficients for spectral variables which are not attributed to spectral lines (with values below noise value) were all set to zero. Obviously, this type of modification leads to a certain distortion of the peak amplitude and peak width, but the peak position (which is important for qualitative analysis) remains unaltered. Regression coefficients themselves can be also used as a smoothed spectrum. While the noise suppression effect is not clearly visible in this case, the spectral resolution can be improved with using more latent variables in the model. This approach was applied for resolution improvement in Fe3O4 and ore spectra. The optimized filter parameters were employed, but the number of LV was set at 6 in order to keep the noise low. Spectral resolution was estimated as the ratio between the distance of two neighboring spectral lines to the line width. In the case of Fe3O4, the resolution improvement was above 20% and for the iron containing ore above 40%. Figure S2 (Supporting Information) illustrates this feature for the Fe3O4 spectrum. EDX Spectrometry Data. As an illustration of PLS smoothing potential in quantitative analysis we addressed the EDX data from lanthanide mixtures where cerium content was determined. For comparison purposes the most widely applied filtration methods were also employed: Savitzky-Golay (SG), Fast Fourier Transform (FFT), and penalized least-squares (PenLS). The natural spectral line width for EDX data is about eight channels. In the case of the PLS filter, the following parameters were found to be optimal: Gaussian basis line width = 5 channels and LV = 3. In case of Savitzky-Golay second order polynomial function with two window widths (5 and 11) was tested. FFT filter parameters were “low pass” with cutoff frequency = 8 Hz; penalized least-squares4 were used with λ = 2. Figure 4 shows the raw and the smoothed spectra for SG and PLS filtration, the latter provides for visually better results. Other filters are shown in Figure S3. Figure 5 shows the difference spectra between the filtered and raw data. 5963

DOI: 10.1021/acs.analchem.8b01194 Anal. Chem. 2018, 90, 5959−5964

Article

Analytical Chemistry detection limit of the method. PLS filter yielded RSS value compared with that of FFT and the highest r-Pearson.

(4) Eilers, P. H. C. Anal. Chem. 2003, 75, 3631−3636. (5) Vivo-Truyols, G.; Torres-Lapasio, J. R.; van Nederkassel, A. M.; Vander Heyden, Y.; Massart, D. L. J. Chromatogr A 2005, 1096, 133− 145. (6) Fu, H. Y.; Guo, X. M.; Zhang, Y. M.; Song, J. J.; Zheng, Q. X.; Liu, P. P.; Lu, P.; Chen, Q. S.; Yu, Y. J.; She, Y. Anal. Chem. 2017, 89, 11083−11090. (7) Douglas, R. K.; Nawar, S.; Alamar, M. C.; Mouazen, A. M.; Coulon, F. Sci. Total Environ. 2018, 616, 147−155. (8) Byrne, H. J.; Knief, P.; Keating, M. E.; Bonnier, F. Chem. Soc. Rev. 2016, 45, 1865−1878. (9) Mir-Marqués, A.; Garrigues, S.; Cervera, M. L.; de la Guardia, M. Microchem. J. 2014, 117, 156−163. (10) Leani, J. J.; Sánchez, H. J.; Valentinuzzi, M. C.; Pérez, C.; Grenón, M. C. J. Microsc. 2013, 250, 111−115. (11) Rutledge, D.; Barros, A.; Delgadillo, I. Anal. Chim. Acta 2001, 446, 279−294. (12) Camacho, J.; Lennox, B.; Escabias, M.; Valderrama, M. J. Chemom. 2015, 29, 338−348. (13) Brown, C. D.; Wentzell, P. D. J. Chemom. 1999, 13, 133−152. (14) Wold, S.; Sjöström, M.; Eriksson, L. Chemom. Intell. Lab. Syst. 2001, 58, 109−130. (15) Kirsanov, D.; Panchuk, V.; Goydenko, A.; Khaydukova, M.; Semenov, V.; Legin, A. Spectrochim. Acta, Part B 2015, 113, 126−131.



CONCLUSION The procedure for signal smoothing based on PLS regression was suggested. The overall idea behind this type of filtering is in sorting the signals according to the variance they hold. It is shown that PLS smoothing allows for significant signal-to-noise ratio improvement without serious distortion of line parameters (width, position, amplitude). Moreover, the PLS smoothing allows for spectral resolution improvement. The examples from Mössbauer spectrometry illustrate the process of filtering parameters selection and show that the suggested procedure can be successfully applied for spectral preprocessing. The features of the PLS smoothing in quantitative analysis were demonstrated with EDX spectrometry data and were compared with other popular smoothers. The basis line employed for PLS decomposition can be of any complexity (e.g., asymmetrical or multiplet), thus the smoother performance can be adapted to the wide variety of real world applications.



ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.analchem.8b01194. Deviations (%) in amplitude (ΔA), line width (Δw), and line position (ΔIS, isomer shift) with different PLS filter parameters; regression coefficients and derived spectra with and without setting around-noise B to zero; regression coefficients and raw signal for Fe3O4 spectral fragment; EDX spectra of lanthanide mixtures; and OLS models for determination of cerium obtained with raw data and different filters (PDF) Mössbauer spectrum of α-Fe with SNR = 3, smoothed spectrum with LV = 2, and decomposition basis with Lorenzian lines, w = 5; Mössbauer spectrum of α-Fe with SNR = 8, smoothed spectrum with LV = 3, and decomposition basis with Gaussian lines, w = 4; EDX spectra of 40 samples (aqueous solutions of lanthanide mixtures); EDX spectra of 40 samples smoothed with Gaussian basis, w = 5, LV = 3; and Gaussian decomposition basis (w = 5) for EDX data (XLSX)



AUTHOR INFORMATION

Corresponding Author

*E-mail: [email protected]. Notes

The authors declare no competing financial interest.



ACKNOWLEDGMENTS This work was partially financially supported by Government of Russian Federation, Grant 074-U01. V.P., A.L., and D.K. acknowledge partial financial support from St. Petersburg State University Project No. 12.37.216.2016.



REFERENCES

(1) Brereton, R. G. Chemometrics. Data Analysis for the Laboratory and Chemical Plant; John Wiley & Sons: Chichester, England, 2003. (2) Savitzky, A.; Golay, M. J. E. Anal. Chem. 1964, 36, 1627−1639. (3) Brown, R. G.; Hwang, P. Y.C. Introduction to Random Signals and Applied Kalman Filtering,3rd ed.; John Wiley & Sons: New York, 1996. 5964

DOI: 10.1021/acs.analchem.8b01194 Anal. Chem. 2018, 90, 5959−5964