Anal. Chem. 2005, 77, 7998-8007
Sign Constraints Improve the Detection of Differences between Complex Spectral Data Sets: LC-IR As an Example Hans F. M. Boelens,*,† Paul H. C. Eilers,‡ and Thomas Hankemeier§
Biosystems Data Analysis Group, Swammerdam Institute of Life Sciences, FNWI, Universiteit van Amsterdam, Nieuwe Achtergracht 166, 1018 WV Amsterdam, The Netherlands, Department of Medical Statistics, Leiden University Medical Centre, P.O. Box 9604, 2300 RC Leiden, The Netherlands, and Packaging Research and Polymer Analysis Group, Analytical Sciences Department, TNO Quality of Life, Utrechtseweg 48, 3704 HE Zeist, The Netherlands
Spectroscopy is a fast and rich analytical tool. On many occasions, spectra are acquired of two or more sets of samples that differ only slightly. These data sets then need to be compared and analyzed, but sometimes it is difficult to find the differences. We present a simple and effective method that detects and extracts new spectral features in a spectrum coming from one set with respect to spectra of another set on the basis of the fact that these new spectral features are essentially positive quantities. The proposed procedure (i) characterizes the spectra of the reference set by a component model and (ii) uses asymmetric least squares (ASLS) to find differences with respect to this component model. It should be stressed that the method only focuses on new features and does not trace relative changes of spectral features that occur in both sets of spectra. A comparison is made with the conventional ordinary least squares (OLS) approach. Both methods (OLS and ASLS) are illustrated with simulations and are tested for size-exclusion chromatography with infrared detection (SEC-IR) of mixtures of polymer standards. Both methods are able to provide information about new spectral features. It is shown that the ASLSbased procedure yields the best recovery of new features in the simulations and in the SEC-IR experiments. Band positions and band shapes of new spectral features are better retrieved with the ASLS than with the OLS method, even those which could hardly be detected visually. Depending on the spectroscopic technique used, the ASLS-based method facilitates identification of the new chemical compounds. Spectroscopic sensors are powerful because they are fast and information-rich and do not require much sample preprocessing. On many occasions, spectra are acquired of two or more sets of samples that differ slightly. These spectral data sets must be * To whom correspondence should be addressed. Phone: +31 20 525 6547. Fax: +31 20 525 6971. † Universiteit van Amsterdam. ‡ Leiden University Medical Centre. § Current address: Analytical Biosciences, Leiden/Amsterdam Drug Research Center, FWN, Leiden University, Einsteinweg 55, 2333 CC Leiden, The Netherlands.
7998 Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
compared. The final goal of such a differential analysis of spectral sets is to find and preferably to quantify the chemical differences between sets of samples. An example is the comparison of industrial, batch reactions (pharmaceutical or fermentation batches) that are monitored with spectroscopy.1,2 Batches may be different because of small amounts of undesired side products. Detecting and identifying these side products by searching and screening the differences between the collected spectroscopic data sets are important. Vibrational spectroscopy, such as infrared and Raman spectroscopy can be used for the characterization of differences between biological samples.3-8 More complex samples are often better studied in detail by hyphenation of a separation method, for example, liquid chromatography (LC), and spectroscopic detection, such as IR,9,10 Raman,11,12 NMR, or MS. Various chemometric techniques are available to find differences between spectral data sets. Pattern recognition techniques13 are used to find differences between groups of samples. Supervised pattern recognition techniques could be used to do this. Pattern recognition techniques, however, run into problems when only a small number of samples are available and especially when only a few test set samples are available. Moreover, the resulting discriminant function of these techniques is not a spectrum. Such a function is, thus, not directly interpretable. Adaptive Kalman filtering (AKF)14 is also used to find new spectral phenomena. It has been used for resolving two overlap(1) Workman, J.; Koch, M.; Veltkamp, D. J. Anal. Chem. 2005, 77, 37893806. (2) Miller, C. E. J. Chemom. 2000, 14, 513-528. (3) Nijssen, A.; Schut, T. C. B.; Heule, F.; Caspers, P. J.; Hayes, D. P.; Neumann, M. H. A.; Puppels, G. J. J. Invest. Dermatol. 2002, 119, 64-69. (4) Mossoba, M. M.; Khambaty, F. M.; Fry, F. S. Appl. Spectrosc. 2002, 56, 732-736. (5) Yang, H.; Irudayaraj, J. J. Mol. Struct. 2003, 646, 35-43. (6) Stone, N.; Kendall, C.; Smith, J.; Crow, P.; Barr, H. Faraday Discuss. 2004, 126, 141-157. (7) Jarvis, R. M.; Brooker, A.; Goodacre, R. Anal. Chem. 2004, 76, 5198-5202. (8) Lopez-Diez, E. C.; Goodacre, R. Anal. Chem. 2004, 76, 585-591. (9) Vonach, R.; Lendl, B.; Kellner, R. J. Chromatogr., A 1998, 824, 159-167. (10) Kok, S. J. Ph.D. Thesis, University of Amsterdam, 2004. (11) Holtz, M.; Dasgupta, P. K.; Zhang, G. F. Anal. Chem. 1999, 71, 29342938. (12) Steinert, R.; Bettermann, H.; Kleinermanns, K. Appl. Spectrosc. 1997, 51, 1644-1647. (13) Hopke, P. K. Anal. Chim. Acta 2003, 500, 365-377. 10.1021/ac051370e CCC: $30.25
© 2005 American Chemical Society Published on Web 11/05/2005
ping compounds in HPLC-DAD.15 The UV spectrum of one of the compounds (interfering agent) was not known. The size and the shape of the innovation sequence of the Kalman filter was used to detect the presence of such an interfering agent. In addition, identification of this compound is performed. Use of the innovation sequence of the Kalman filter is similar to using the residuals of a conventional least squares approach (OLS). The method proposed here will be compared with below. Another tool that could be used is self-modeling curve resolution (SMCR).16,17 The standard way of applying SMCR, however, tries to retrieve the concentration profiles and spectra of all components from one single data set. This is possible when restrictions based on general a priori knowledge, such as nonnegativity or unimodality of profiles, are applied. Finding the differences between similar spectral data sets is not discussed in SMCR literature, although the technique could probably be adapted. Windig18,19 explicitly addresses the topic of retrieving differences from LC/MS and HPLC-DAD measurements on very similar samples. He is, however, interested in finding small differences along the retention time axis and not along the spectroscopic channels. His new methods to find those differences are not related to the method proposed here. The method proposed here makes essential use of the fact that the spectral features of interest are positive. It is based on an algorithm that is simple and fast to converge. Moreover, it can detect the difference between a single test spectrum and a set of reference spectra. However, it should be emphasized that the focus is on the case that a sample contains one or more new and hitherto unknown compounds with respect to other samples, so the problem of tracking of concentration changes of compounds present in all samples is not addressed. One of the measured spectral sets is taken as the reference set. Spectra from the other set, the test set, are compared with these reference spectra. The first question that arises is whether a test spectrum is at all different from the reference set. When the spectral differences are large and the spectroscopic technique is rather selective, such as IR and Raman, they can be found by visual inspection. For complex spectra at low signal-to-noise or poor spectral resolution, statistical techniques can help to detect spectral differences. Finding that a statistical difference exists, however, does not indicate which new spectral band-like features are present in the spectrum and at which channels (wavelengths, wavenumbers) they appear. Additionally, the band shape of the new features is revealing. Having all this information and given a suitable spectroscopic technique, it is possible to track new functional groups or ultimately new chemical compounds of the test sample. For example, in vibrational spectroscopy (IR and Raman), investigating which bands are absent and which are present allows identification of chemical compounds.20 (14) Rutan, S. C. Anal. Chem. 1991, 63, 1103A. (15) Chen, J.; Rutan, S. C. Anal. Chim. Acta 1996, 335, 1-10. (16) de Juan, A.; Tauler, R. Anal. Chim. Acta 2003, 500, 195-210. (17) Jiang, J. H.; Ozaki, Y. Appl. Spectrosc. Rev. 2002, 37, 321-345. (18) Windig, W.; Smith, W. F.; Nichols, W. F. Anal. Chim. Acta 2001, 446, 467476. (19) Windig, W.; Marchincin, T. F.; Meyer, G. N. Appl. Spectrosc. 2003, 57, 1575-1584. (20) Scorates, G. Infrared and Raman Characteristic Group Frequencies; Wiley: New York, 2001.
The proposed procedure (i) characterizes the spectral reference set by a component model and (ii) uses asymmetric least squares (ASLS) to find differences of a test spectrum with respect to this component model. The procedure will be explained and compared with a conventional least squares approach (OLS). Its properties will be discussed in the appendix. The ASLS method will be illustrated with simulations, and its potential is further demonstrated for data obtained by size exclusion chromatography (SEC) with infrared (IR) detection of mixtures of polymers. It will be shown that small differences between the different chromatographic runs that could hardly be detected visually are better revealed by ASLS than by OLS. THEORY Basic Approach. It is assumed that all spectra of the reference and test set are measured on the same instrument using the same number of channels (nchan). These channels may be wavelengths, wavenumbers, or, for example, chemical shifts in NMR spectroscopy. The basic approach to the problem is to correct the spectra of the test set for the spectral features that occur in the reference spectra. Subsequently, the residual spectral variation characteristic only for the test set can be analyzed. The main assumption of the method is that spectra are positive quantities. This is a reasonable assumption, because many types of spectroscopy deliver strictly positive quantities. For example, near-infrared and infrared spectroscopy (transmission and reflectance), UV-vis (transmission), Raman, and NMR spectroscopy yield positive quantities. In practice, the instrumental noise, spectral drift, or other artifacts may blur the actually measured spectral signal and cause it to be (slightly) negative. Noise cannot be removed; it can only be smoothed at the cost of losing spectral resolution. Spectral drift phenomena that cause such negativity, however, may be corrected for by a suitable spectral preprocessing. Signal Model. Let the matrix Xr(nchan × nr) contain the spectra of the reference set on its columns, and let xt(nchan × 1) be a spectrum of the test set. The proposed signal model of this spectrum is
xt ) Pβ + E
(1)
P(nchan × np) is a known matrix that summarizes the spectral features of the reference set. The np columns of the matrix P are the loading vectors of a principal component analysis21,22 and they span the reference spectral space. Actually, using a principal component analysis to summarize the features of the reference set is not essential. Any component model that adequately summarizes the spectral variation of the reference set may be used. It should be pointed out that this signal model (eq 1) is incomplete. The test set spectrum can only be modeled with the knowledge coming from the reference set. Estimation of the coefficient vector β (np × 1) yields the total contribution (viz., Pβ) of the spectral features of the reference set to the measured test spectrum. By straightforward subtraction (eq 2), the residuals (E) contain the new spectral features of interest: (21) Malinowski, E. R. Factor Analysis in Chemistry; Wiley: New York, 2002; Chapter 4. (22) Jolliffe, I. Principal Component Analysis; Springer-Verlag, New York, 2002.
Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
7999
E ) xt - Pβ
(2)
Obviously, the difficulty is the estimation of the vector β. Two estimation methods are considered (OLS and ASLS). Ordinary Least Squares (OLS). When β is estimated with OLS, the residuals E may be interpreted in two ways. In the first way, the measured spectrum of the test set (vector xt) is split into a projection on the vector space spanned by the columns of the matrix P, and a part that is orthogonal to that space23 (see Figure 1). This orthogonal part is the E of eq 1. In the terminology introduced by Lorber,24 the vector E can now be regarded as the net signal that is fully specific for the test set sample and that is not affected by any change in the composition of known reference compounds. However, because of the incomplete signal model, part of the new spectral features may be captured by the reference model. This occurs whenever a band of the new compound in the test set spectra overlaps with bands of the reference spectra. In such a case, the component Pβ will not be zero, even when a test spectrum contains only new spectral features. Another way is to interpret E as the residual signal after a regression. Normally, in regression, a full model is fitted to the measured data. Use of a full and correct data model yields residuals that resemble spectral noise. Such a full model is, however, not available, and the partial model (eq 1) leads to an incorrect estimation of the reference contribution (Pβ). The residual signal will accordingly have a distinct pattern. Given the fact that the nonmodeled spectral contributions are all positive, it is expected that positive parts of the residual signal still indicate at which channels those new features are located. In addition, the approximate intensity of these new features could be extracted from the residual signal. Examples will be given in the Results Section. Asymmetric Least Squares (ASLS). There are two main differences between asymmetric least squares (ASLS)25,27 and the ordinary least squares (OLS). In OLS, all data points receive equal weights; in ASLS, positive and negative residuals are weighted differently. Additionally, ASLS is an iterative least squares procedure in which these data weights are updated each iteration. The procedure terminates when the data weights are stable. As such, ASLS is a member of the family of the iteratively reweighed least squares techniques.28,29 In this particular application of the ASLS, positive residuals are weighted less than the negative residuals because it is expected that E is a spectrum with essentially only positive intensities at each channel. Disregarding the spectral noise or possible artifacts for a moment, the parameter vector β should be adjusted in such a way that negative residuals do not occur. This adjustment is made by increasing the weight of the negative residuals. The Results Section discusses two examples that demonstrate the advantage of this approach. Each iteration of the (23) Draper, R. D.; Smith, H. Applied Regression Analysis; Wiley: New York, 1998. (24) Lorber, A. Anal. Chem. 1986, 58, 1167-1172. (25) Eilers, P. H. C. Kwantitatieve Methoden 1988, 23, 45. (26) Groenen, P. J. F. Convexity of a function, personal communication. (27) Eilers, P. H. C. Anal. Chem. 2004, 76, 404-411. (28) Beaton, A. E.; Tukey, J. W. Technometrics 1974, 16, 147-185. (29) Holland, P. W.; Welsch, R. E. Commun. Stat.-Theo. M. 1977, 9, 813-827. (30) Malinowski, E. R. J. Chemom. 1999, 13, 69-81.
8000
Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
Figure 1. Orthogonal vector (E) in the OLS method. Two vectors, p1 and p2 (columns of matrix P), span the space of the reference measurements. Estimating the coefficient vector β with OLS is identical to splitting the test set measurement (xt) in a vector in the reference space P and a vector orthogonal to the reference space ().
ASLS algorithm minimizes the weighted sum of squares. nchan
L(β) )
∑w
2
i i
(3)
i)1
The weight wi of each channel i depends only on the sign of the corresponding residual. For positive residuals (i), this weight is set to φ, and for negative residuals, it is set to 1 - φ. The (fixed) parameter φ, the asymmetry parameter, takes values between 0 and 1. When φ is set equal to 0.5, all residuals have the same weight, which makes ASLS identical to OLS. Positive residuals are weighted less in this application of ASLS, and therefore, φ only takes values in the range 0 < φ < 0.5. Once the data weights are assigned, β can be updated by a weighted linear regression.23 Using the updated β, the residuals are again calculated, and data weights are reassigned. These iterations are repeated until the weights are stable. It can be shown that this algorithm is gradient-following28 and that its goal function (eq 3) is convex. Thus, convergence must follow26 (see Supporting Information). Experience has shown that convergence is obtained quickly, and at the most, 10 iterations are needed. Selection of the φ Value. φ is a parameter of ASLS that should be set. A lower value of φ lets ASLS focus more on removing negative residuals. φ values that are too low will, however, incorrectly force negative spectral noise contributions to be positive. A higher value of φ, being closer to 0.5, will cause the ASLS solution to resemble the OLS solution. To select a suitable value of φ, the following method is proposed. The median of the amplitude distribution of the residuals (Meda) is compared to the median of the amplitude distribution of the original measured signal (Medm). Medians are used here instead of the means to get robust estimates of the centers of the amplitude distributions. When the Meda is lower than the Medm, the φ is selected too high. On the other hand, when Meda becomes higher than Medm, we are in the situation that negative spectral noise values are forced to be positive. That value of φ is selected for which Meda is equal to Medm. The assumptions underlying the method are that either the number of baseline points exceeds the number of data points on the bands in the spectrum or that the spectral shapes of the measured spectrum and the residual signal do not differ too much. Summary of the Method. Step 1. Determine the dimension of the spectral subspace spanned by all reference spectra. A
principal component analysis22 is used to find the loading vectors, pi, that span this subspace completely. The number of principle components (np) is determined with a Malinowski F-test.30 This method proved to work well for the type of spectroscopic data used here. Obviously, another appropriate method may be used to determine the full dimension of the reference space. This step yields the matrix P that is used in eq 1. Step 2. The asymmetry parameter (φ) of the ASLS method is selected by comparing the median of the residual signal (Meda) with the median of the measured signal (Medm). The φ value at which these medians are equal is chosen. Step 3. The test set spectrum is analyzed using the OLS and ASLS methods. EXPERIMENTAL SECTION Chemicals. Chloroform from Biosolve (Valkenswaard, The Netherlands) was used as a solvent (mobile phase) in all experiments. Solutions (10 mg/mL) of polycarbonate (PC)(molar mass, 36 000 g/mol; supplied by TNO Industry, Eindhoven, The Netherlands), PMMA (molar mass, 35 000 g/mol; Scientific Polymer Products, Ontario, NYork), and PBMA (molar mass, 180 000; Scientific Polymer Products) were prepared by weighing and dissolving in chloroform. Chromatographic Setup. The SEC system (Waters 2690 Separations Module, Milford, MA) is equipped with a vacuum degasser and a thermostated column. Two different separation setups are used for the SEC-IR experiments in this study. In the first setup, A, two PLgel 5-µm mixed C columns (each 250 mm × 7.5 mm i.d., Polymer Labs, Church Stretton, UK) are used in series for separation at a flow rate of 0.4 mL/min. The total run time of the SEC-IR analysis is 61 min. In three runs, the PMMA (50 mg/ mL) homopolymer, the PC (50 mg/mL) homopolymer, and a mixture of PMMA/PC polymers (each 50 mg/mL) are measured. In the second setup, B, one HSP-gel mixed l/m (150 mm × 6 mm i.d., Waters, Milford, MA) is used for separation at a flow rate of 0.4 mL/min. The total run time is 12 min, and the sample volume of the polymer solutions of a concentration of ∼50 mg/ mL is in all cases 50 µL. The temperature of the column is maintained at 30 °C. In four runs, the PMMA homopolymer (50 mg/mL), the PBMA (50 mg/mL) homopolymer, the PC homopolymer (50 mg/mL), and a mixture of PMMA/PBMA polymers (each 50 mg/mL) are measured. Acquisition and Initial Processing of the Spectral Data. The FT-IR system consists of a Perkin-Elmer GX-MCT/B spectrometer (Norwalk, CT) and a high-pressure flow cell (Reflex Analytical, Ridgewood, NJ). Two CaF2 windows of 13-mm diameter (2-mm thickness) are used (transmission range 50 000-1100 cm-1, which are separated by a circular 0.10-mm-thick Telfon spacer (8-mm clear aperture). Data acquisition is triggered by the LC system. Single-beam spectra (wavenumber range from 700 to 4000 cm-1; resolution 1 cm-1) are continuously recorded. Each single beam spectrum is an average of four scans. During a SEC run, ∼1 spectrum/min is collected. The spectral data acquisition is the same for both experimental setups, A and B. The single beam spectra (SB) are imported in MATLAB 6.5 (The MathWorks, Inc., Natick, MA, 2002), in which all further processing is done. “Pseudo” absorbance spectra (PA) for each measured spectrum (i) are calculated using the first measured single beam spectrum of the mobile phase (SB1) that do not
contain any sample, as a background single beam spectrum.
PAi ) log10
SBi (i > 1) SB1
(4)
At chloroform (solvent) bands, no information about the polymers can be obtained. These band ranges, for example, 12001230 cm-1, are not used and are indicated as gray areas in the figures. The spectra acquired with setup A do not require any further preprocessing. Some of the spectra collected with setup B have to be corrected for water bands using the following procedure. The water spectrum is estimated from the SEC-IR run itself (see 1). Subsequently, the intensity of a free waterband (see 2) is used for making the correction (see 3). 1. Estimate the PA spectrum of water by selecting a single beam spectrum at the beginning (SBstart) and the end (SBend) of the SEC-IR run. The water content should be different at the start and the end of the SEC-IR run to be able to estimate the water spectrum, which is a reasonable assumption if the water background in the IR system itself changes during the run. Care should be taken also that only the mobile phase (solvent) is measured in both cases.
PAwater ) log10
SBstart SBend
(5)
2. In each IR spectrum, the free water band located at 3854 cm-1 is used. The intensity at 3854 cm-1 is corrected for background by subtraction of the spectral intensity at 3895 cm-1. The background corrected intensity, A, is used to quantify the amount of water. 3. The corrected spectrum is calculated as
PAi,c ) PAi - PAwater
Ai Awater
(6)
in which Awater is the background corrected intensity of the PAwater spectrum. Simulations. To demonstrate the principle of the procedure, simple spectra containing two and three bands are simulated. In addition, a mixture of three homopolymers is simulated by adding the spectra of SEC-IR runs of those homopolymers (using setup B). The ratio of PC/PBMA/PMMA polymer in the reference set is 1.0/0.2/0.0, and the ratio of the same polymers the test set is 0.3/0.4/ 0.6. All these simulations are performed in MATLAB. RESULTS AND DISCUSSION Comparison of OLS and ASLS for Synthetic Spectra. To study the differences between ordinary and asymmetric least squares, some simple, noiseless spectra are simulated and analyzed. The value of φ in the ASLS routine is for this experiment set to 10-5. Figure 2 shows two different examples of a situation with two components in the test set. One compound (xr) is in the reference set (left most band) and one unknown compound (xn), in the test set (band maximum indicated by vertical line). Two situations are discussed, a situation without (Figure 2A and C) and with (Figure Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
8001
Figure 2. Retrieval of the spectrum of an unknown compound for two different situations using OLS and ASLS (φ ) 10-5). The rightmost band is from the unknown compound. Its band maximum (vertical line) is at channel 150 in the nonoverlapping situation (A and C) and at 60 in the overlapping situation (B and D). The band of the reference compound is always at 50. The true spectra of the compounds in both situations are shown in A and B. C and D show the “measured” test set spectrum (xt)(solid) and the OLS (dotted) and the ASLS residuals (dashed). Note: small offsets are added to the residual signals to show differences more clearly.
2B and D) band overlap of the two compounds. The overall synthetic profile (xt, solid line, Figure 2C and D) can only be modeled with the knowledge available from the reference set. Equation 1 is simplified to
xt ) x r β + E
(7)
in which β is a scalar constant. In the case of no band overlap (Figure 2C), the E’s obtained with OLS and ASLS (dotted and dashed lines, respectively) are identical. OLS is able to fully retrieve the band shape of the unknown compound because the band of the unknown compound does not increase the band intensity at the channels where the reference compound has intensity. However, in the situation of band overlap, the residual signals obtained with OLS and ASLS are very different (Figure 2B and D). The ASLS method is still able to extract the correct band shape of the unknown compound, but the OLS utterly fails to do so. The reason is that the band of the unknown compound also contributes to the spectrum for channel numbers around 50. This causes the intensity of the test spectrum to be higher than expected when only the reference compound would have been present. In eq 7, the test spectrum is modeled only with the reference spectrum; therefore, the value of β will be estimated too high. As a result, too much of the reference signal is subtracted from test spectrum (xt), and negative intensities are obtained in the residual signal of the OLS method. These most negative values are found at those channel numbers (left side of the reference band) where no overlap between the unknown compound and reference spectra exists. The reason that the ASLS method correctly retrieves the band of the unknown compound from the test set is that the asymmetric weighting of the residuals puts more weight on the negative than on the positive residuals. In doing so, the channels having negative residuals become more important for estimating β. Channels 8002 Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
Figure 3. A three-compound situation; two of them are reference compounds. A: Vertical line indicates the band maximum of the unknown band (solid line). Dashed lines are bands of the two reference compounds. B: “Measured” test set spectrum (solid); OLS (dotted) and ASLS residuals (φ ) 10-5, dashed). C: Change of residual signal in ASLS as a function of the number of iterations. Convergence is achieved after 10 iterations. D: Accuracy of estimated band intensities of reference compounds for different φ values (compd 1: estimated, b; true intensity, solid line; compd 2:(estimated, O; true intensity, dotted line). Note: the spectra are simulated and the test spectrum has been obtained by adding the three individual spectra.
having negative residual intensities are less disturbed by contributions of the band of the unknown compound and are thus better suited for the estimation of the contribution of the reference compound. This property of the ASLS procedure can be seen as a clever and automatic selection of the most relevant channels for the estimation of the reference contributions. The residual signal of OLS is more difficult to interpret, but it still can be used to find the approximate band position of the unknown compound. The position of the maximum of the positive residuals is still an indicator for the band position of new spectral features in this case. Note, however that this maximum is slightly shifted to the right of the true position in Figure 2D. The intensity of the maximum is also lower than the band maximum of the original unknown band. Increasing band overlap will lead to smaller and more difficult interpretable residual signals of the OLS method, as the following example illustrates. A more complex, simulated situation with severe band overlap of three compounds is shown in Figure 3. The spectra of the reference compounds (dashed lines) severely overlap the band of the unknown one (Figure 3A, solid line). The simulated test spectrum appears to have only one band (Figure 3B). The OLS residual signal (dotted line) is not adequate; two small positive local maximums are found, suggesting that two new unknown bands are present. Neither of these maximums coincides with the band maximum of the unknown compound. In addition, the intensity of the residual signal is small. In contrast, the ASLS residuals describe the unknown compound spectrum much better. Figure 3C illustrates what happens. For the first iteration step, the ASLS result is identical to the OLS solution. With increasing iteration number, the retrieval of the band of the unknown compound is improving. Finally, the band of the unknown compound is extracted very well. The band position and intensity
Figure 4. Retrieval of PMMA spectra (C shows the spectra of the SEC-IR of the PMMA standard) from SEC-IR analysis of a mixture of PMMA and PC polymer (spectra shown in D). The spectra from an SEC-IR analysis of PC are used as a reference set. Residual signals are obtained with the OLS method (A) and the ASLS (φ ) 10-3) method (B). Vertical lines show band positions of PMMA. Grey vertical areas in this and subsequent figures indicate wavelength ranges with strong solvent (chloroform) absorption bands. The PC polymer spectrum of a standard at maximum elution (broad solid line) is shown in D.
Figure 5. A: Plot of the median of the reconstructed data (Meda) against the φ value of the ASLS method. The horizontal line (dotted) corresponds to the median of spectra measured during a run (Medm). B: Plot of the average correlation between reconstructed and true PMMA spectra against the φ value of the ASLS method.
match the true profile nearly perfectly. The relative intensity difference between the true and estimated bands of the unknown compound is smaller than 1%. Figure 3D shows that the accuracy of the band intensities of the reference compounds depends on the value of φ. For this simulated example (noiseless data), φ values e10-5 will all provide a good estimation of these band intensities. Higher φ values lead to overestimation of the band
Figure 6. Elution profiles of PMMA reconstructed from a SEC-IR analysis of a PMMA/PC mixture obtained by different methods: the (assumed) “true” profile determined from summed spectral intensities between 1720 and 1725 cm-1 (solid line). Profiles derived from ASLS residuals signal (dotted line) and from OLS residuals (dashed line) are shown. For the latter two lines, the (corrected) summed intensity between 1150 and 1155 cm-1 is used, at which also PC showed absorption. The vertical line is at the maximum of the “true” profile. For more details, see text.
intensity for one of the reference compounds and underestimation for the other. SEC-IR of a Two-Polymer Mixture. The applicability of the procedure to complex real-life spectral sets is tested for the identification of new compounds in a SEC-IR chromatogram, as compared to a reference one. A PMMA standard, a PC standard, and a mixture of both are analyzed by SEC-IR (setup A). Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
8003
Figure 7. Impact of the different values of φ on the residual signal when retrieving PMMA from a PMMA/PC mixture using the ASLS method. A: φ ) 10-3; B: φ ) 10-5; C: φ ) 10-1. Arrows indicate differences in B and C spectral sets with respect to spectral set A. Same data sets were used as in Figure 4.
Figure 4 shows the spectrum taken at the peak maximum of the chromatogram of the SEC-IR PC standard (broad solid line, D) and the spectra of the SEC-IR analysis of the PMMA standard (C). The strong bands of PMMA and PC located at 1728 and 1770 cm-1, respectively, are only slightly overlapping, whereas in the range of 1140-1300 cm-1, severe band overlap occurs. Comparing Figure 4D with 4C also shows that some PMMA bands (vertical lines) in the 1140-1300 cm-1 range are not visible at all in the spectra of the PMMA/PC mixture (Figure 4D). The goal is to reveal the complete PMMA spectra from the spectra measured during the SEC-IR run of the polymer mixture. The spectra measured during the SEC-IR analysis of PC are used to span the reference space. The Malinowski F-test finds a threedimensional reference space. Figure 5A shows a plot of the φ value against the median of the reconstructed PMMA spectra (Meda). For φ ) 10-3, the value of Meda is approximately equal to the value of Medm. According to step 2 of our method, this φ value should be used in the ASLS method. Figure 4 shows that the reconstructed PMMA spectra using this φ value in the ASLS method (Figure 4B) and the reconstructed spectra using OLS method (Figure 4A) differ significantly. Using ASLS, the residual pattern matches the IR spectra obtained for the SEC-IR run of pure PMMA very well with regard to the band intensities and band position (compare Figure 4B with 4C). Even the bands of PMMA at 1194 and 1274 cm-1 that are not visible in the spectra of the polymer mixture (Figure 4D) are recovered well by the ASLS method. With the OLS method, some positive local maximums in the residual signal match rather well 8004 Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
(solid vertical lines in Figure 4A) with known band positions of PMMA, but some bands are not revealed (dotted vertical lines in the same figure). In addition, a spurious band at 1171 cm-1 is found. Moreover, two bands at 1194 and 1244 cm-1 are not revealed by OLS because strong bands due to the PC polymer are present at nearly the same positions (Figure 4A). Another low-intensity band at 1388 cm-1 is missed because baseline disturbances blur this band. Finally, the intensity of the OLS residual signals is lower than the ASLS residual signal. In summary, the detection of the new spectral features is difficult with the OLS method, in contrast to the ALS method. It should be mentioned that the S/N ratio of these spectra for the main absorption bands is ∼250 (σnoise ) 2 × 10-4 A.U.) and for the minor absorption bands that should be retrieved, ∼20. To investigate the quantitative recovery of new features in difficult situations, the elution profile of PMMA of the SEC-IR analysis of the PC/PMMA mixture using the ASLS and the OLS methods are compared with the actual (“true”); PMMA elution profile in that sample. The PMMA elution profile is derived from the residual signals of the ASLS and the OLS methods as a reconstructed functional group chromatogram of the 1152 cm-1 absorption band by summation of the intensities of the residual signal in the range 1150-1155 cm-1 (dotted and dashed lines in Figure 6, respectively). The challenge is to reveal the correct elution profile, despite the disturbance by the PC spectrum in this wavenumber region. For reference, the actual elution profile of PMMA of the SEC-IR analysis of the PC/PMMA mixtures is reconstructed for the 1722 cm-1 absorption band by summation of the spectral intensities
Figure 8. Retrieval of PMMA spectra from SEC-IR analysis of a mixture of PMMA and PBMA polymers using the spectra of a PBMA SEC-IR analysis as the reference data set. Residual signals are obtained using the OLS method (A) and the ASLS (φ ) 5 × 10-5) method (B). The SEC-IR run of the PMMA polymer (C) and the spectra of a PMMA/PBMA mixture (D) are shown. Additionally, the spectrum of a PBMA standard at elution maximum (broad solid line) is shown (D).
between 1720 and 1725 cm-1 (Figure 6, solid line); this wavenumber range is not disturbed by the PC spectrum. Next, the actual (assumed to be “true”) functional group chromatogram of the 1152 cm-1 PMMA peak is calculated from the (undisturbed) 1722 cm-1 functional group chromatogram and corrected by the peak ratio of the bands at 1152 and 1722 cm-1 in the SEC-IR analysis of a standard PMMA sample. The thus obtained 1152 cm-1 PMMA peak profile in the SEC-IR of the PC/PMAA sample closely matched the one of the PMMA standard. The PMMA profile obtained by ASLS is only slightly lower (2%) than the actual PMMA profile, whereas the PMMA profiled by OLS is significantly less intense (16%) (Figure 6). The asymmetry factor, φ, of the ASLS method affects the results of the spectral reconstruction. From Figure 5A, it can be seen that for a value of φ ) 10-3, the median of the measured mixture spectra Medm and the median of the reconstruced PMMA spectra are the same. Figure 5B shows the average correlation of all the reconstructed PMMA spectra with the true PMMA spectra. This average correlation also has an optimum near φ ) 10-3, indicating that on average, the best reconstruction of the PMMA spectra is possible with such a φ value. Figure 7 shows the shape changes in the PMMA reconstructed spectra when the φ value is varied. Increasing φ reduces the difference in weight of the positive and negative residuals. When φ approaches 0.5, the ASLS solution resembles an OLS solution (Figure 7C). The negative spectral features (indicated by the arrows) emerge at the same positions
as in Figure 4A (OLS solution). Making φ too low leads to small positive bands at the same locations, and more spectral noise seems to be present (Figure 7B). The reason is that for decreasing φ, the positive residuals will receive lower weight. In this way, the impact of positive errors caused by noise and estimation errors persist. SEC-IR of More Complicated Polymer Mixtures. Two more challenging examples were studied. In the first example, the spectral features of PMMA were extracted from the IR spectra of a SEC-IR run of a mixture of PBMA and PMMA polymers. In the second example, artificial mixture spectra that mimic a SEC-IR run of mixtures of two and three polymers were generated. For this SEC-IR spectrum of PMMA, PBMA and PC homopolymer using setup B were used. The aim again was to retrieve the PMMA spectral features. The first example may be considered difficult because the absorbance spectra of the PMMA and PBMA homopolymers are very similar (see Figure 8C and D). The band overlap of the spectra is severe, making the retrieval of specific PMMA features difficult. For example, around 1730 cm-1, the bands of PMMA and PBMA are only slightly shifted with respect to each other (Figure 8D), and in the 2800-3000 cm-1 range, the polymer spectra are nearly completely overlapping. The OLS method cannot reveal the band at 2951 cm-1 and has a problem revealing the exact position of the PMMA band at 1728 cm-1 (Figure 8A). In addition, the maximum intensity of the OLS residual signal at ∼1730 cm-1 is lower than for the ASLS signal (compare Figure Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
8005
Figure 9. Retrieval of PMMA spectra from a simulated SEC-IR data set of a PC/PMMA/PBMA mixture using the SEC-IR spectra of a PBMA/ PC mixture as a reference with another composition as the reference set. The simulated SEC-IR data are obtained by addition of the spectra of the SEC-IR runs of the individual homopolymers. Residual signals are obtained with the OLS method (A) and the ASLS (φ ) 1.3 × 10-3) method (B). Scaled spectra of the SEC-IR analysis of the PMMA polymer only (C), the spectra of mixtures (solid lines), and the spectra of PC (broad dashed solid line) and PBMA (fat solid line) are shown (D).
8A and B). The ASLS method is able to recover the band position of the three important PMMA bands (1728, 2951, 2996 cm-1) well; however, the intensity of the recovered band at 1728 cm-1 is lower (∼22%) than the expected intensity (Figure 8C). Additionally, some spurious and small bandlike phenomena can be seen around 1800 cm-1. It can be concluded that the ASLS method gives a fairly accurate picture of the spectral features of PMMA, but the OLS method misses the larger part of the PMMA spectral features. By adding the spectra obtained by the SEC-IR analysis of the individual homopolymers (PC, PMMA, and PBMA), two simulated SEC-IR analyses of polymer mixtures are generated. The reference set consists of spectra from a simulated analysis of a PC and a PBMA mixture. The test set consists of spectra of simulated SECIR analysis of a PC/PMMA/PBMA mixture with a different PC/ PBMA ratio. The difference in the PC and PBMA composition between the reference measurements and the test measurement makes the extraction of the features of PMMA in principle even more difficult than in the previous case. Figure 9 shows that the ASLS method is able to retrieve the new spectral features due to PMMA in the test set measurement rather well, in contrast to the OLS method (Figure 9A). All major band positions of PMMA are retrieved; only the overall intensity of the reconstructed signal is lower (20%), but relative intensities of bands in the signal are nearly the same as in the PMMA spectra (Figure 9C). 8006 Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
CONCLUSIONS The general challenge of the comparison of complex spectral data sets has been addressed by a new method: a simple and effective method based on asymmetrical least squares (ASLS) is proposed to retrieve new spectral features in spectra, as compared to a reference set. A significant improvement could be achieved by using the fact (and the constraint) that these new spectral features are positive. The proposed ASLS-based method has been proven to be much better than a conventional method in discovery of new features in data sets obtained by hyphenated chromatography coupled to IR detection (SEC-IR) of mixtures of polymers. Band positions, band shapes, and band intensities of new spectral features are much better retrieved with the ASLS than with the OLS method and are very close to the true values. Actually, the ASLS method is able to discover new features in spectral regions with severe band overlap that cannot be distinguished by eye. Another improvement of ASLS with respect to OLS is that the residual signal of ASLS can be interpreted with the same strategy that is generally used in spectroscopy. For example, the residual signal in the SEC-IR data retrieved by ASLS can be interpreted as an ordinary IR spectrum. Another advantage of the method is that shifts in the elution time do not affect the procedure at all. At the moment, the potential of the described approach for more complex data sets, that is, SEC-IR analyses of aging polymers, is investigated. In addition, applications are envisioned in the field of biomarker discovery for, for example, diagnosis. The ASLS
method may be used as a tool in a more elaborate procedure that tries to find and to identify biomarkers.
discussions. Dr. Renger Jellema and Dr. Sabina Bijlsma are acknowledged for stimulating discussions on the processing of complex SEC-IR data sets.
ACKNOWLEDGMENT
Received for review August 2, 2005. Accepted September 26, 2005.
We thank Erwin Kaal, Dr. Sander Kok, and Dr. Leon Coulier for the acquisition of the SEC-IR data and for stimulating
AC051370E
Analytical Chemistry, Vol. 77, No. 24, December 15, 2005
8007