Extraction of Pure Species Spectra from Labeled Mixture Spectral

Jun 17, 2019 - Extraction of pure species spectra from mixture spectra is important in characterizing unknown mixtures as well as in the monitoring of...
0 downloads 0 Views 2MB Size
Article pubs.acs.org/IECR

Cite This: Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Extraction of Pure Species Spectra from Labeled Mixture Spectral Data Abhishek Baikadi,† Nirav Bhatt,*,‡,§ and Shankar Narasimhan*,†,§ †

Department of Chemical Engineering, ‡Department of Biotechnology, and §Robert Bosch Centre for Data Science and Artificial Intelligence, Indian Institute of Technology Madras, Chennai, Tamil Nadu 600036, India

Downloaded via NOTTINGHAM TRENT UNIV on August 13, 2019 at 08:11:21 (UTC). See https://pubs.acs.org/sharingguidelines for options on how to legitimately share published articles.

S Supporting Information *

ABSTRACT: Extraction of pure species spectra from mixture spectra is important in characterizing unknown mixtures as well as in the monitoring of chemical reactions. In many designed experimental studies, the mixture concentrations are completely known and partial knowledge of the spectra of some of species in a mixture may also be available. In this study, we extend the methods of ordinary least squares (OLS), principal component regression (PCR), and non-negative matrix factorization (NMF) to incorporate such additional information. The performances of the three proposed methods are evaluated using simulated and experimental data. Among these, the proposed constrained NMF (cNMF) method is shown to be best-suited for obtaining feasible and accurate estimates of pure species spectra from mixture spectra.



INTRODUCTION Spectroscopy is one of the routinely used techniques for inferring concentrations of various species in a mixture. Further, spectroscopy allows us to measure concentrations online at high sampling rates. Hence, spectrometers are used for online monitoring of (bio)chemical processes in industries. Multivariate calibration (MVC) and self-modeling curve resolution (SMCR) are two common applications to analyze mixture spectra.1 The purpose of MVC is to develop a regression (or calibration) model based on spectral data and the corresponding concentrations.1 The calibration model can then be used to predict the concentrations of mixtures from their spectra. This is especially useful for online monitoring, control, and optimization of chemical and biochemical reactions. Methods such as ordinary least-squares (OLS), principal component regression (PCR), and partial least-squares (PLS) are typically used to build the regression model.2 For developing such a model, mixtures containing different selected concentrations of the species are prepared (usually through an experimental design) and their spectra are measured to obtain a calibration set. In certain cases, a calibration model may not be available and it may be required to determine both the concentrations and spectra of pure species from the spectra of mixtures containing these species. This problem is referred to as SMCR in the chemometrics literature.3,4 This is similar to the blind source separation (BSS) problem addressed in signal processing literature.5,6 In these problems, the mixture spectra are assumed to be a weighted combination of pure species © 2019 American Chemical Society

spectra, where the weights are the concentrations of the species in the mixture. Multivariate curve resolution-alternating least-squares (MCR-ALS) is one of the widely used methods for SMCR.4,7−9 Without any additional information, a unique solution to the problem cannot be guaranteed. Instead of obtaining the pure species spectra, a linear combination of the spectra may be obtained. This is known as rotational ambiguity. Even if pure species spectra are extracted, only relative (or scaled) concentrations of the species in the mixture are obtained. This is known as scaling ambiguity. Constraints such as non-negativity of pure species spectra and their concentrations, unimodality of certain types of concentration profiles, and mass balance closure conditions are applied to resolve the rotational ambiguity. However, these constraints rarely guarantee a unique solution, and a set of feasible solutions is also possible.3 Hence, finding and applying appropriate constraints that lead to a unique solution are key elements in enhancing the MCR-based techniques.3 Alternatively, SMCR methods based on the concepts of information entropy and statistical methods have been proposed in the literature.10−14 Chew et al.11 proposed an approach called band-target entropy minimization to recover pure species spectra from mixture spectra. The main Special Issue: Dominique Bonvin Festschrift Received: Revised: Accepted: Published: 13437

January 23, 2019 June 10, 2019 June 17, 2019 June 17, 2019 DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Industrial & Engineering Chemistry Research

Article



EXTRACTING PURE SPECIES SPECTRA FROM MIXTURE SPECTRA Spectroscopy allows measuring concentrations of species in a mixture indirectly in an inline/online manner. Consider a mixture containing s species. Let z(k) be the spectral vector for the kth mixture scanned at n wavelengths. The following assumptions are made regarding the mixture of s species and spectral vector z(k): (A1) All the species are absorbing and nonreacting, (A2) Beer’s law is valid, i.e., the spectra depend linearly on the concentrations of species, (A3) the pure species spectra (PSS) are linearly independent. Then, using assumptions A1−A3, the spectral vector, z(k), can be written in terms of an (s × n)-dimensional PSS matrix (E) and the sdimensional column vector containing concentrations c(k) as follows:

advantages of this method are (i) reduction of computation time due to handling one spectrum at time, and (ii) capability of estimating components having low spectral signal in measured data.12 SMCR methods based on blind source separation techniques such as independent component analysis (ICA) and non-negative matrix factorization (NMF) have also been developed in the literature.15,16 In some applications, some of the pure species spectra in the mixture may be available, and it is desired to extract the spectra of remaining species in the mixture. We present three possible use cases. • A standard problem is to estimate the concentration of glucose in blood, fruits, or biochemical reaction using NIR. It is not possible to obtain a pure spectrum of glucose and it has to be inferred from the NIR spectra of aqueous solutions of glucose.17

z(k) = c(k)T E

• In recent years, analysis of the NMR spectra of urine sample is used for detecting certain diseases through the presence of certain metabolities not present normally in urine samples of healthy subjects.18 In general, the metabolites or their spectral signatures are not known a priori and have to be extracted by deconvolution of NMR spectra and using the spectra of those metabolities known to be present in normal samples.

(1)

For m mixtures, eq 1 can be written as follows: Z = CE

(2)

where Z is an (m × n)-dimensional spectral data, C is an (m × s)-dimensional concentration matrix with each column consisting of concentrations of species. In practice, the spectral data is corrupted with the noise. The noisy spectral data Zm can then be written as follows:

• In online reaction monitoring using spectrophotometers, we may have to deal with short-lived (unstable) intermediates whose spectra cannot be obtained through direct measurement. For monitoring such reactions, the spectra of unstable intermediates can be obtained only through deconvolution of reaction mixture spectra.19

Zm = Z + F = CE + F

(3)

where F is an (m × n)-dimensional error matrix. It is assumed that the errors are independently and identically distributed Gaussian random variables with zero mean and covariance matrix σ2I. The objective of this section is to develop approaches for extracting pure species spectrum or spectra (E) using the labeled data, the measured spectral data Zm and concentration data C and/or a subset of noisy but known pure species spectra. We classify these approaches for extracting PSS into three types: (i) Use of multivariate calibration techniques, (ii) use of blind source separation technique, and (iii) use of hybrid techniques.

In addition, concentrations of some or all of the species in a mixture may be available through other analytical techniques. Such additional information in SMCR methods can be used to enhance not only the accuracy of pure species spectra extraction but also to guarantee unique solutions. The objective of this work is to develop approaches for incorporating such additional information in existing methods to extract pure species spectra from spectra of mixtures containing these species. We show how MVC methods, OLS and PCR, can be used to extract pure component spectra. We also propose a modified NMF method to utilize concentration data to obtain unique and feasible (non-negative) estimates of pure component spectra. The modified NMF is labeled as constrained NMF. All these methods are also extended to incorporate spectra of some of the pure species, if available. The different methods differ with respect to the assumptions they make to deal with noise in measured concentrations and measured spectra. The proposed approaches are especially useful in building a database of reference spectra of pure species. Such databases are subsequently useful in determining concentrations of unknown mixtures. The article is organized as follows. The second section presents approaches to recover pure component spectra using OLS and PCR. In the next section, constrained NMF is developed to extract pure component spectra incorporating information on concentrations and known pure component spectra of a subset of the species in the mixture. The following section then describes the simulated and experimental data set used to demonstrate various approaches developed in this work. The results are discussed and appropriate conclusions are drawn at the end of the article.



USE OF MULTIVARIATE CALIBRATION TECHNIQUES Two MVC techniques, ordinary least-squares (OLS) and principal component regression (PCR), will be used to extract PSS in this section. In MVC, the concentrations and corresponding spectral data of mixtures are assumed to be available. Extracting Pure Species Spectra Using OLS. OLS has generally been used to identify a regression model for predicting concentrations from the spectral data. However, because both mixture spectra and concentrations are available, eq 3 can be used to directly obtain a least-squares estimate of the PSS. The ordinary least-squares estimate is given by Ê = (CT C)−1CT Zm

(4)

In deriving the above estimates for PSS, it is assumed that (i) the concentration measurements are noise free, and (ii) the errors in mixture spectra for all mixtures are independent and have identical variances. If the PSS of a subset of species are known, then the unknown PSS can be estimated using OLS, using a simple modification. Let Ek be the (sk × n)-dimensional matrix of known spectra, and Eu be the (su × n)-dimensional matrix of 13438

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Article

Industrial & Engineering Chemistry Research Ê = βs−1 V1T

unknown spectra. Note that s = sk + su. The measured spectral data Zm can then be written as follows: Zm = Ck Ek + CuEu + F

If spectra of some of the pure species are provided, then this information can also be utilized in the above procedure as follows. As before the known spectra matrix of the pure species is denoted as Ek. However, unlike OLS, the known pure species spectra are not assumed to be free of errors, but are assumed to be corrupted by errors. The variances of errors in both the mixture spectra and known pure species spectra are assumed to be same. The mixture spectral data matrix is augmented with the noisy known pure species spectral data to obtain the augmented spectral data matrix given by ÄÅ ÉÑ ÅÅ Z ÑÑ Z̃ = ÅÅÅÅ ÑÑÑÑ ÅÅÇ Ek ÑÑÖ (15)

(5)

where Ck and Cu are the (m × sk)- and (m × su)-dimensional matrices of the known concentrations corresponding to the known pure species spectra and the unknown species spectra, respectively. Assuming that the spectra of the known species Ek is noise free, the contribution of the unknown part (Z̅ m) can be computed as follows: Z̅ m = Zm − Ck Ek = CuEu + F

(6)

The spectra of the unknown species can now be estimated using OLS as follows: Ê u = (CTu Cu)−1CTu Z̅ m

(7)

Extracting Pure Species Spectra Using PCR. Unlike OLS, principal component regression (PCR) accounts for errors in both the mixture spectra and concentration measurements to develop a MVC model using a two-step procedure. We propose a modification of this two-step method to extract pure species spectra as follows. 1. In the first step, PCA is applied to the mixture spectra data matrix Z to obtain an (m × s) lower dimensional matrix T of full column rank using singular value decomposition as follows. Z = U1S1V1T + U2S2V2T

As before, in the first step PCA is applied to the augmented data matrix (Z̃ ) and the scores matrix, T̃ corresponding to the first s largest singular values is obtained from the singular value decomposition of the augmented data matrix. The scores matrix (T) corresponding to the m mixture spectra is obtained by extracting the first m rows of T̃ . A linear regression model relating the concentration to the scores followed by an estimate of the pure species spectra are obtained as before using eqs 12 and 14. It may be noted that the scores matrix (whose columns define the s-dimensional pure species spectral subspace) is obtained using both the noisy mixture and known pure species spectra in the first step of this modified PCR. Furthermore, an updated estimate of the known pure species spectra is also obtained in this process in the second step.

(8)

where U1: n × s and V1: n × s are orthonormal matrices, whose columns are, respectively, the left and right singular vectors corresponding to the first s largest singular values, and S1:s × s is a diagonal matrix containing the corresponding singular values. The first term in the right-hand side of eq 8 can also be regarded as the estimated denoised spectra of all mixtures, while the second term can be attributed to the estimated noise. The matrix T (also known as the scores matrix) is obtained as T = U1S1 = ZV1



EXTRACTING PURE SPECIES SPECTRA USING NMF NMF decomposes a given non-negative (m × n)-dimensional matrix (here the term non-negative matrix implies that all the elements of the matrix are non-negative) Z into two lower rank non-negative matrices W(m × s) and H(s × n). This decomposition can be performed by minimizing the cost function F(Z, WH) as follows:

(9)

and the denoised mixture spectra are obtained as Ẑ = U1S1V1T = TV1T

(10)

1 Z − WH 2

2. In the second step, a multivariate linear regression model between the concentrations and the scores is obtained by assuming the scores to be noise free and the concentrations measurements to be noisy. The linear regression model can be written as

C = Tβs + Fc

(11)

(12)

It may be noted that βs is a square matrix of full rank and is therefore invertible. From eq 10 and eq 12 the linear relation between the denoised mixture spectra and concentrations can be obtained as Ẑ = TV1T = Cβs−1 V1T

2

=

1 2

∑ [Z − WH]ij2 ij

(16)

Equation 1 shows that the noise-free mixture spectral matrix can be factored as a product of two lower rank matrices, NMF can be used to simultaneously extract both the concentration matrix and pure species spectral matrix by interpreting the matrix W to be the concentration of the mixtures and the rows of the matrix H to correspond to the pure species spectra. Both the true concentrations and spectral intensities are nonnegative and NMF ensures that the corresponding estimates are also non-negative, unlike in OLS or PCR. Classical NMF does not require concentration measurements and extracts pure species spectra using only mixture spectra. For this purpose, the number of mixtures should be greater than or equal to the number of pure species in the mixture. The solution of the NMF problem can be obtained using alternating least-squares (ALS). The algorithm is given as follows: Step 1 Initiate W, H; s.t. Wi,j, Hj,k ≥ 0 ∀ i,j,k. Step 2 Given H; solve the following problem

where βs is the regression matrix and Fc is an (m × s)dimensional error matrix. The least-squares solution of βs is given by βs = (TTT)−1TTC

(14)

(13)

From the above relation, the estimated pure species spectra are therefore derived as 13439

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Article

Industrial & Engineering Chemistry Research 1 ||Z − WH||2 2

(17a)

subject to wij ≥ 0, ∀ i , j

(17b)

minimize W

minimize W, H

subject to wij ≥ 0, ∀ i , j , hjk ≥ 0, ∀ j , k ,

Step 3 Given W; solve the following problem 1 ||Z − WH||2 2

(18a)

subject to hjk ≥ 0, ∀ j , k

(18b)

minimize H

W = Wk + 1

(19a)

H = Hk + 1

(19b)

(21)

where α ≥ 0 is a weighting parameter, which denotes the accuracy of concentration measurements relative to that of the spectral measurements. In addition, if PSS of a subset of species are available, then Z can be augmented as shown in eq 15. Then, the objective function in eq 21 can be rewritten in terms of the augmented matrices Z̃ and C̃ as follows:

Step 4 If ||Wk − Wk+1 || ≥ ϵW or ||Hk − Hk+1|| ≥ ϵH, go to Step 1. Otherwise terminate this algorithm.

minimize W, H

1 ̃ ̃ ||2 + α 1 ||C̃ − W̃ ||2 ||Z − WH 2 2

subject to wij ≥ 0, ∀ i , j , hjk ≥ 0, ∀ j , k ,

ÄÅ ÉÑ Å ÑÑ where W̃ = ÅÅÅ W Ñ. It can be seen that the last sk rows are ÅÇ I ̃ ÑÑÖ perfectly known, and hence, they should not be updated. The optimization problems in eqs 21 and 22 can be solved using alternating least-squares (ALS) framework for NMF. Two subproblems can be formulated as follows:

Several algorithms for improving the convergence speed have been proposed in the literature to solve the NMF problem.20,21 However, the NMF solution suffers from the following ambiguities:22 • Scale and rotational ambiguity: Consider a nonsingular matrix M ∈ +s × s . Then, the factorization of Z can also represented as WH = (WM)(M−1H) = WM HM

1 1 ||Z − WH||2 + α ||C − W||2 2 2

(22)

• Given H:

(20)

It is possible to find several solutions for M such that WM and HM are non-negative. Then, WM and HM is also a solution of the problem (16). Only under very restrictive conditions can NMF guarantee a unique solution.23 • Ordering ambiguity: Note that Z ≈ ∑iw.,ihi,. = ∑iw.,p(i)hp(i),. with p(·) is an ordering (or permutation). Hence, it is not possible to identify which column of W corresponds to which pure species spectra, without additional information.

1 1 ||Z − WH||2 + α ||C − W||2 W 2 2 subject to wij ≥ 0, ∀ i , j. minimize

(23)

• Given W: 1 ||Z − WH||2 H 2 subject to hij ≥ 0, ∀ i , j. minimize

Because NMF has many solutions, additional constraints such as sparsity, partial or full knowledge of noisy W or H need to be imposed to obtain a unique solution of the problem (16). Paatero24 has developed a method which allows incorporation of different types of linear constraints in NMF. The solution of this constrained NMF problem is obtained using the conjugate gradient method. Although this can take into account different types of constraints, it may not be an efficient approach to include additional information such as concentration and/or pure spectra. In the following subsection, a variant of NMF denoted as constrained NMF (cNMF) is proposed to incorporate information on concentration data and information on the known pure component spectral data. The proposed cNMF algorithm retains the advantage of the alternating least-squares approach and solves a quadratic program at each step, and is a computationally more efficient strategy as compared to using a general purpose constrained optimization method. Constrained NMF. To incorporate available concentration information, we propose a constrained NMF (cNMF) formulation by adding a penalty term to modify the objective function of standard NMF. The modified constrained NMF formulation is given by

(24)

The objective functions in eq 23 and 24 are quadratic with linear inequality constraints. However, to use a standard quadratic programming solver, the gradient and the Hessian of the objective function with respect to the decision variables have to be provided. As shown in the Supporting Information, the concept of Kronecker products can be used to cast the two subproblems as standard quadratic programming (QP) problems as follows:25 • Given H: 1 vec(W)T [2(HHT ⊗ In) + 2α(Is ⊗ In)] 2 vec(W) − vec(ZHT + αC)T vec(W)

min W

st − Insvec(W) ≤ 0ns (25)

• Given W: 13440

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Article

Industrial & Engineering Chemistry Research

the range of 300 and 650 nm at intervals of 2 nm. This leads to a (3 × 176)-dimensional matrix of pure species spectra matrix.

min vec(H)T (Im ⊗ WTW)vec(H) − 2vec(WTZ)T H

vec(H) st − Imsvec(H) ≤ 0ms (26)

With non-negativity constraints on the elements of W and H, the matrices (HHT ⊗ In) + 2 α (Is ⊗ In) and Im ⊗ WTW in eqs 25 and 26 are positive semidefinite matrices. Hence, the objective functions in eqs 25 and 26 are standard quadratic programming problems, and the optimal solutions of these subproblems exist. Hence, standard QP solvers in commercial scientific packages can be applied to obtain solutions of the problems. It can be proved21,26 that the ALS algorithm converges to a solution, although the solution may be a local optimum and not the global optimum. In contrast to the standard NMF algorithms based on the update rules, note that the above solution strategy using the QP can handle the presence of negative elements in the measured spectral data. Although the main objective of this paper is to estimate unknown pure species spectra, cNMF can also be used to obtain relative estimates of the concentrations of all species in all the mixtures. (from the estimated W matrix). If one of the species concentrations is available in every mixture, then the absolute concentrations of all species can also be obtained. In this study, however, we have not explored this aspect further. The modifications of all the above methods, OLS, PCR, and cNMF, have been presented on the basis of the assumption that the errors corrupting the spectral and/or concentration measurements have identical variances (homoscedastic errors). If the errors are heteroscedastic, then it is possible to handle them by appropriate weighting of the data matrices. For simplicity, we consider the case when error variances vary along only one mode. For example, if the error variances vary only with respect to wavelength but do not vary with respect to samples, then the absorbances have to be scaled using the inverse of the standard deviations of errors (by postmultiplying the spectra data matrix with inverse of the square root of the error covariance matrix of absorbance measurements) before applying the modified PCR and cNMF methods. Similarly, if the concentration measurements of different species have different error variances which do not change with respect to samples, the concentration matrix can be scaled using the inverse of the square root of the error covariance matrix of concentration measurements in the modified cNMF method. The error variances have to be known or estimated from replicate measurements. The steps involved in these methods are given in the Supporting Information.

Figure 1. Pure species spectra used for generating the simulated data sets.

• Concentrations of the three species for 30 mixtures are generated by choosing random numbers from a uniform distribution in the interval [0.2, 0.3] for each species. This leads to a (30 × 3)-dimensional concentration matrix. • The (30 × 176)-dimensional noise-free spectral matrix is obtained by multiplying the concentration matrix and the pure species spectra matrix. The noisy spectral and concentration matrices are generated by adding random noise to the noise-free spectral and concentration matrices. Two different error covariance structures, namely, homoscedastic (equal error variances) and heteroscedastic (unequal error variances) are used to obtain two different data sets. For data set 1, the standard deviation of the error in absorbance is taken to be 1% of the maximum of the true absorbances among all mixtures. A (30 × 176)-dimensional error matrix is generated from a standard normal distribution and multiplied with this standard deviation. For data set 2, the standard deviation of error in absorbance measurement of a mixture at a particular wavelength is taken to be equal to 5% of corresponding noise-free absorbance. Then, a random number is generated from a standard normal distribution and multiplied with this standard deviation for each wavelength to obtain a (30 × 176)dimensional error matrix. The noisy spectral data sets are generated by adding the error matrices to the noise-free spectral matrix. The noisy concentration matrix is obtained by adding homoscedastic noise to the noise-free concentration data. The standard deviation of the error in concentration is taken to be 1% of the maximum of the concentrations among all mixtures. A (30 × 3)-dimensional matrix is generated from a standard normal distribution and multiplied with this standard deviation to obtain the error matrix. The noisy concentration matrix is obtained by adding the error matrix to the noise-free concentration matrix. Figure 2 shows the simulated data with heteroscedastic noise Two scenarios are considered for each data set. In the first scenario, it is assumed that all PSS are unknown, and the proposed methods are applied to recover the PSS. In the second scenario, it is assumed that an estimate of third PSS is



DATA SETS USED IN EVALUATION OF OLS, PCR, AND CNMF The three methods proposed in the previous sections are evaluated using two simulated data sets and one experimental data set consisting of mixtures of three species. A comparative evaluation of three methods is also performed. Simulated Data Sets. Simulated data sets consisting of three species mixtures are generated in the following manner. The noise free data sets are generated as follows: • Three pure species spectra are taken from the experimental data set obtained in ref 27 and are shown in Figure 1. These spectra are obtained between 13441

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Industrial & Engineering Chemistry Research

Article



RESULTS AND DISCUSSION The OLS, PCR, and cNMF methods are applied to the simulated and experimental data. The root-mean-square-error Table 1. Comparison of RMSE Values for the Simulated Data Sets species

OLS

PCR

cNMF

1 2 3 1 2 1 2 3 1 2

0.0317 0.0341 0.0387 0.0262 0.0265 0.0733 0.0783 0.0821 0.0626 0.0636

0.0321 0.0350 0.0392 0.0266 0.0271 0.0786 0.0801 0.0865 0.0651 0.0664

0.0316 0.0327 0.0380 0.0255 0.0258 0.0728 0.0669 0.0769 0.0569 0.0582

data set 1 scenario 1 data set 1 scenario 2 data set 2 scenario 1 data set 2 scenario 2

Figure 2. Simulated data set with heteroscedastic noise.

(RMSE) between estimated and reference PSS is calculated for comparing these methods as follows:

available, and the proposed methods are applied to recover the remaining two PSS from the spectral data. Experimental Data Set. The experimental data set contains the spectral data obtained from the mixture of three metal ions (Co(II), Cr(III), and Ni(II)) in 4% HNO3 solution obtained by.27 The data set contains the spectral data for 26 mixtures between the range of 300 nm between 650 nm at intervals of 2 nm. For each mixture, five replicates of its absorbance spectra have been taken. The standard deviations of error in the absorbance in each mixture at each wavelength can be estimated from the replicate measurements. The spectral data and standard deviations are shown in Figure 3. It can be observed from this figure that the noise standard deviations vary with respect to wavelength as well as with respect to mixtures, especially in the 300−350 nm and 600− 650 nm ranges, whereas the error standard deviations are almost the same in the 350−600 nm range. The proposed methods are applied under the assumption of homoscedastic errors as well as heteroscedastic errors (error variances varying with respect to wavelength), to assess the impact of taking error variances into account when they are available.

N

RMSE =

∑ (eiest − eiref )/N i=1

(27)

ref where eest i and ei are the estimated and reference pure species spectra at the ith wavelength, respectively. Further, the OLS and PCR methods are also compared using number of negative elements estimated. Comparison of Performance on Simulated Data Sets. The performance of OLS, PCR, and cNMF are presented on the two simulated data sets for both the scenarios. The RMSE values for all the cases are given in Table 1. A value of α = 500 was used to obtain the cNMF results. Table 1 indicates that cNMF performs better than the OLS and PCR methods for all scenarios. Surprisingly, OLS performs marginally better than the PCR-based method. The performance of all three methods is superior for homoscedastic errors in spectral data (Data set 1) in comparison to heteroscedastic errors in spectral data (Data set 2). This indicates that it may be possible to improve the accuracy of the methods by scaling the spectral

Figure 3. Experimental data for metal ion mixtures (UV data set). 13442

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Article

Industrial & Engineering Chemistry Research

Figure 4. Estimated pure species spectra for data set 2.

measurements using error standard deviations, provided that such information is available.28 For both data sets, the RMSE values of all methods is lesser for Scenario 1 as compared to corresponding values for Scenario 2. This indicates that knowledge of a subset of PSS helps to improve the estimates of the unknown PSS for all three methods. For data set 2, the estimated and the simulated PSS are shown in Figure 4 for all three methods. It can be observed from Figure 4 that the OLS and PCR methods yield negative estimates for PSS. The negative values of absorbance are not physically meaningful. Table 2 provides the number of negative estimates obtained by the OLS and PCR methods all the scenarios. Table 2 indicates that the performance of the methods deteriorate for heteroscedastic errors. Also, note that both methods estimated non-negative values for species 1 in all the scenarios. It can be seen from Figure 4 that the true values of absorbance for species 1 are far from zero (greater than 0.2) in comparison to the true values of absorbance for species 2 and 3. Hence, this observation indicates that the OLS and PCR methods should not be applied to data sets in which PSS contain regions with near zero absorbances. Comparison of Performance on Experimental Data. We present the results obtained by applying the proposed methods on the experimental data set described in the preceding section. Note that the experimental data set contains heteroscedastic errors which vary both with respect to wavelength and mixture. Since replicate measurements for all mixture spectra are available, the methods were applied to the averaged mixture spectra data set, as well as to two data sets

Table 2. Comparison of Negative Values in the Estimated PSS for the Simulated Data Sets data set 1 scenario 1 data set 1 scenario 2 data set 2 scenario 1 data set 2 scenario 2

species

OLS

PCR

1 2 3 1 2 1 2 3 1 2

0 13 4 0 7 0 23 12 0 21

0 14 4 0 8 0 26 14 0 23

Table 3. Comparison of RMSE Values of Normalized Spectra Obtained for the Experimental Data

averaged data

data set R1

data set R2

species

OLS

PCR

cNMF

NMF

Co Cr Ni Co Cr Ni Co Cr Ni

0.0225 0.0097 0.0267 0.0379 0.0188 0.0358 0.0334 0.0147 0.0277

0.0243 0.0105 0.0271 0.0885 0.0968 0.0324 0.0496 0.0416 0.0311

0.0239 0.0099 0.0260 0.0428 0.0243 0.0320 0.0344 0.0181 0.0258

0.0910 0.0544 0.0585 0.0875 0.0643 0.0752 0.0922 0.0682 0.0626

13443

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Article

Industrial & Engineering Chemistry Research

Figure 5. True and estimated pure species spectra for the averaged experimental data set.

constructed by randomly picking one mixture spectra from the five replicates of the experimental data set. A high value of α = 10 000 is used in the cNMF method because the mixture concentrations have very little error. The RMSE values for three data sets are given in Table 3. In addition to three proposed methods, the standard NMF method is also applied to the experimental data. Since the extracted PSS using standard NMF has a scale ambiguity, the estimated and experimentally measured PSS are normalized before computing the RMSE value. The permutation ambiguity in standard NMF is resolved by determining for each estimated spectra the maximally correlated measured PSS. The table shows that all three methods perform substantially better in estimating the PSS than the standard NMF. This is clearly due to the use of concentration information in the proposed methods. The performance of OLS and cNMF are comparable and better than that obtained using PCR. Although the estimated spectra for Co and Cr using OLS is marginally better than cNMF, the estimated spectra of Ni using cNMF is marginally better than that obtained using OLS. The estimated normalized PSS for the averaged mixture spectra data set are compared with the experimentally measured normalized PSS and shown in Figure 5. It can be seen from Figure 5 that all methods perform equally well in homoscedastic noise region (350−620 nm). However, the PSS estimates by all methods are poor for heteroscedastic noise region (300 nm -349 nm, and 620 nm -650 nm). Since the

Table 4. Comparison of Negative Values in the Estimated PSS for the Experimental Data

averaged data

data set R1

data set R2

species

OLS

PCR

Co Cr Ni Co Cr Ni Co Cr Ni

1 1 3 6 2 11 10 5 5

1 3 3 14 27 13 9 17 7

Table 5. Comparison of RMSE Values of Normalized Spectra Obtained for the Experimental Data When the Co Spectrum Is Known averaged data data set R1 data set R2

species

OLS

PCR

cNMF

Cr Ni Cr Ni Cr Ni

0.0110 0.0257 0.0203 0.0323 0.0137 0.0279

0.0113 0.0259 0.0850 0.1070 0.0332 0.0373

0.0115 0.0248 0.0262 0.0350 0.0177 0.0275

13444

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Article

Industrial & Engineering Chemistry Research

Figure 6. True and estimated pure species spectra after incorporating information on heteroscedastic error for the averaged experimental data set.

Figure 7. True and estimated pure species spectra after incorporating information on heteroscedastic error for the random set 1 experimental data set.

Figure 8. True and estimated pure species spectra after incorporating information on heteroscedastic error for the random set 2 experimental data set.

are not shown.) The number of negative values in the

measured concentrations contain little noise, all the methods perform equally well in homoscedastic region. Further, Figure 5 indicates that the OLS and PCR methods estimate negative values of absorbance in the heteroscedastic region. (This observation is valid for other two scenarios, however, the plots

estimated PSS for OLS and PCR are given in Table 4. The results show a significant reduction in the number of negative estimates by using averaged data. 13445

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Industrial & Engineering Chemistry Research



Table 6. Comparison of RMSE Values of Normalized Spectra Obtained by Applying Different Methods with the Information of Heteroscedastic Errors for the Experimental Data

averaged data

data set R1

data set R2

species

PCR

cNMF

Co Cr Ni Co Cr Ni Co Cr Ni

0.0225 0.0097 0.0267 0.0380 0.0189 0.0357 0.0335 0.0148 0.0276

0.0223 0.0095 0.0257 0.0322 0.0166 0.0289 0.0283 0.0116 0.0249

Article

ASSOCIATED CONTENT

S Supporting Information *

The Supporting Information is available free of charge on the ACS Publications website at DOI: 10.1021/acs.iecr.9b00437.



Derivation of eqs 25 and 26 and handling of heteroscedastic errors (PDF)

AUTHOR INFORMATION

Corresponding Authors

*Email: [email protected]. *Email: [email protected]. ORCID

Nirav Bhatt: 0000-0003-0928-6028 Shankar Narasimhan: 0000-0002-0558-0206 Notes

The proposed methods are also applied to the experimental data assuming that the PSS for Co is known. Table 5 shows the RMSE values of normalized estimated spectra of Ni and Cr. Comparing with the corresponding RMSE estimates shown in Table 3, it can be observed that significant improvement in the accuracy of estimating the unknown PSS is obtained for all three methods by incorporating the knowledge of one of the species spectra. In the experimental data, replicate measurements for mixture spectra is available. The replicate measurements can be used to obtain an estimate of the error standard deviations for each mixture corresponding to each wavelength. Since the dominant variation of error standard deviations (as shown in Figure 3b) is with respect to wavelength, we assume that the error variances vary only with respect to wavelength and not with respect to mixtures. The average of the standard deviations over all mixtures estimated from replicate measurements is used to obtain the variation of error standard deviation with respect to wavelength. The scaling procedure described in the Supporting Information is used to deal with heteroscedastic errors and the PSS are estimated using the OLS, PCR and cNMF methods. The estimated PSS using the OLS method are same as the one estimated without the scaling. The estimated PSS using the PCR and cNMF methods are shown in Figures 6, 7, and 8. The comparison of RMSE values of normalized spectra for the PCR and cNMF are reported in Table 6. By comparing Tables 3 and 6, it can be observed that the RMSE values for all the species decrease when the information on heteroscedastic errors is incorporated. The improvement in estimation accuracy is obtained by using the mean of the replicate measurements as well as by incorporating the information on error standard deviations obtained from the replicates.

The authors declare no competing financial interest.



ACKNOWLEDGMENTS The financial support to Nirav Bhatt from Department of Science & Technology, India through INSPIRE Faculty Fellowship is acknowledged.



REFERENCES

(1) Malinowski, E. R. Factor Analysis in Chemistry; John Wiley & Sons, 1991. (2) Brereton, R. G. Chemometrics: Data Analysis for the Laboratory and Chemical Plant; John Wiley & Sons, 2003. (3) de Juan, A.; Tauler, R. Multivariate curve resolution (MCR) from 2000: Progress in concepts and applications. Crit. Rev. Anal. Chem. 2006, 36, 163−176. (4) Tauler, R. Application of non-linear optimization methods to the estimation of multivariate curve resolution solutions and of their feasible band boundaries in the investigation of two chemical and environmental simulated data sets. Anal. Chim. Acta 2007, 595, 289− 298. (5) Cichocki, A.; Zdunek, R.; Amari, S. New algorithms for nonnegative matrix factorization in applications to blind source separation. IEEE International Conference on Acoustics, Speech and Signal Processing; IEEE, 2006. (6) Zdunek, R.; Cichocki, A. Nonegative matrix factorization with quadratic programming. Neurocompting 2008, 71, 2309−2320. (7) Tauler, R.; Kowalski, B. R. Multivariate curve resolution applied to spectral data from multiple runs of an industrial process. Anal. Chem. 1993, 65, 2040−2047. (8) Tauler, R. Multivariate curve resolution applied to second order data. Chemom. Intell. Lab. Syst. 1995, 30, 133−146. (9) Tauler, R.; Smilde, A.; Kowalski, B. R. Selectivity, local rank, three-way data analysis and ambiguity in multivariate curve resolution. J. Chemom. 1995, 9, 31−58. (10) Kawata, S.; Minami, K.; Minami, S. Superresolution of fourier transform spectroscopy data by the maximum entropy method. Appl. Opt. 1983, 22, 3593−3598. (11) Chew, W.; Widjaja, E.; Garland, M. Band-target entropy minimization (BTEM): An advanced method for recovering unknown pure component spectra. Application to the FTIR spectra of unstable organometallic mixtures. Organometallics 2002, 21, 1982−1990. (12) Widjaja, E.; Li, C.; Chew, W.; Garland, M. Band-target entropy minimization. A robust algorithm for pure component spectral recovery. Application to complex randomized mixtures of six components. Anal. Chem. 2003, 75, 4499−4507. (13) Kopriva, I.; Jerić, I. Multi-component analysis: blind extraction of pure components mass spectra using sparse component analysis. J. Mass Spectrom. 2009, 44, 1378−1388.



CONCLUSIONS In this work, we propose modifications of two popular multivariate calibration methods (OLS and PCR) and a source separation method (NMF) to incorporate known information on mixture concentrations and pure species spectra to extract unknown pure species spectra from mixture spectra data. Among the methods, the constrained NMF is recommended because it provides feasible and more accurate estimates. It is also shown that information on error standard deviations, if available from replicate measurements, can be used to improve the estimation accuracy of the methods. 13446

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447

Article

Industrial & Engineering Chemistry Research (14) Monakhova, Y. B.; Astakhov, S. A.; Kraskov, A.; Mushtakova, S. P. Independent components in spectroscopic analysis of complex mixtures. Chemom. Intell. Lab. Syst. 2010, 103, 108−115. (15) Gobinet, C.; Perrin, E.; Huez, R. Application of Non-negative Matrix Factorization to Fluorescence Spectroscopy. European Signal Processing Conference; Vienna, Austria; The European Association for Signal Processing, 2004; pp 1095−1098. (16) Moussaoui, S.; Brie, D.; Mohammad-Djafari, A.; Carteret, C. Separation of non-negative mixture of non-negative sources using a Bayesian approach and MCMC sampling. IEEE Trans. Signal Process. 2006, 54, 4133−4145. (17) Chen, J.; Arnold, M. A.; Small, G. W. Comparison of combination and first overtone spectral regions for near-infrared calibration models for glucose and other biomolecules in aqueous solutions. Anal. Chem. 2004, 76, 5405−5413. (18) Wishart, D. S.; Querengesser, L. M.; Lefebvre, B. A.; Epstein, N. A.; Greiner, R.; Newton, J. B. Magnetic resonance diagnostics: a new technology for high-throughput clinical diagnostics. Clin. Chem. 2001, 47, 1918−1921. (19) Ma, J.; Qi, J.; Gao, X.; Yan, C.; Zhang, T.; Tang, H.; Li, H. Study of Chemical Intermediates by Means of ATR-IR Spectroscopy and Hybrid Hard-and Soft-Modelling Multivariate Curve ResolutionAlternating Least Squares. J. Anal Methods Chem. 2017, 2017, 1. (20) Lee, D. D.; Seung, H. S. Algorithms for non-negative matrix factorization. Adv. Neural Inf. Process. Syst. 2001, 556−562. (21) Kim, J.; Park, H. Fast nonnegative matrix factorization: An active-set-like method and comparisons. SIAM J. Sci. Comput. 2011, 33, 3261−3281. (22) Rapin, J.; Bobin, J.; Larue, A.; Starck, J. Sparse and Nonnegative BSS for Noisy Data. IEEE Trans. Signal Process. 2013, 61, 5620−5632. (23) Donoho, D. L.; Stodden, V. C. When does non-negative matrix factorization give a correct decomposition into parts? Adv. Neural Inf. Process. Syst. 2003, 1141−1148. (24) Paatero, P. The multilinear engine−a table-driven, least squares program for solving multilinear problems, including the n-way parallel factor analysis model. J. Comput. Graph. Stat. 1999, 8, 854−888. (25) Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, U.K., 2004. (26) Lin, C. J. Projected gradient methods for nonnegative matrix factorization. Neural Computing 2007, 19, 2756−2779. (27) Wentzell, P. D.; Andrews, D. T. Maximum likelihood multivariate calibration. Anal. Chem. 1997, 69, 2299−2311. (28) Wentzell, P. D.; Andrews, D. T.; Hamilton, D. C.; Faber, K.; Kowalski, B. R. Maximum Likelihood Principal Component Analysis. J. Chemom. 1997, 11, 339−366.

13447

DOI: 10.1021/acs.iecr.9b00437 Ind. Eng. Chem. Res. 2019, 58, 13437−13447