ChemNets: Theory and Application - American Chemical Society

Such a priori knowledge may help in building a model that is one step closer to the true underlying model than a model constructed by a standard neura...
0 downloads 0 Views 929KB Size
Articles Anal. Chem. 1995, 67, 1497-1504

ChemNets: Theory and Application Ziii Wang,tl@Jenq-Neng Hwang,* and Btuce R. Kowalski**t Laboratory for Chemometrics, Center for Process Analytical Chemistry, Department of Chemistry BG- 10, and Department of Electrical Engineering, University of Washington, Seattle, Washington 98 195

ChemNets are introduced for certain types of applications by taking advantage of previously developed chemical theories and incorporating them into neural network structures. Such a priori knowledge may help in building a model that is one step closer to the true underlying model than a model constructed by a standard neural network. A robust and parsimonious neural structure with good predictive abilily is expected for ChemNets. In this paper, the theory of ChemNets is presented. A ChemNet is designed for Taguchi sensors and compared to the standard neural network. The ChemNet constructed on the basis of Taguchi sensor theory showed a significant advantage in terms of parsimony, as noted in the relative parsimony of ChemNet model parameters in comparison with those of neural networks. With the increasing demand for nonlinear data analysis, the number of nonlinear multivariate methods introduced into the chemical literature has increased. These methods include multiplicative scattering correction (MSC) ,Is2 nonlinear principal component regression (NPCR),3 nonlinear partial least-squares regression (NPLS) ,4,5 locally weighted regression (LWR) ,+8 projection pursuit regression (F'PR) ,9J0 alternating conditional expectations (ACE),11J2 and multivariate adaptive regression ' Department of Chemistry.

* Department of Electrical Engineering. 5 Present address: Ohmeda Inc., The BOC Group, 1315 W. Century Dr.,

Louisville, CO 80027. (1) Geladi, P.; Mcdougall, D.; Martems, H. Appl. Spectrosc. 1985,39,491. (2) Isaksson, T.;Naes, T. Appl. Spectrosc. 1988, 42, 1273. (3)Vogt, N. B. Chemom. Int. Lab. Syst. 1989, 7,119. (4)Wold, S.; Kettaneh-Wold, N.; Skagerberg, B. Chemom. Int. Lab. Syst. 1989,

7,53. (5) Wold, S.Chemom. Int. Lab. Syst. 1992, 14,71. (6)Naes, T.; Isaksson, T.; Kowalski, B. R Anal. Chem. 1990, 62,668. (7) Naes, T.; Isaksson, T. Appl. Spectrosc. 1992, 46,34. (8)Wang, Z.;Isaksson, T.; Kowalski, B. R Anal. Chem. 1994, 66,249. (9) Friedman, J. H.;Stuezle, W. J. Am. Stat. Assoc. 1981, 76,817. (10)Beebe, K. R;Kowalski, B. R Anal. Chem. 1988, 60,2273. (11)Breiman, L.;Friedman, J. H. J. Am. Stat. Assoc. 1985, 80, 580. 0003-2700/95/0367-1497$9.00/0 0 1995 American Chemical Society

splines (MARS).13J4 Recently, researchers have applied neural networks (NNs) in analytical chemi~try.'~J~ Sekulic et al.12 compared the performance of several nonlinear methods, and NNs were found to provide better results in most cases. Naes et al." tested NNs on two near-infrared spectroscopic data sets. In both cases, the NNs produced better results than linear methods. Gemperline et a1.16 found that orthogonal transformation of the response variables can sign3cantly improve the training speed and the overall precision obtained from NNs as compared to other nonlinear methods. Blank and Brow+ compared NNs with linear and nonlinear PCR and PLS methods in modeling simulated and real spectroscopic data representing a range of different nonlinearities. Their results were consistent with the conclusion that NNs offer increased modeling power as compared to other nonlinear methods. Goodacre et al.19 analyzed pyrolysis mass spectra to obtain quantitative information representative of the complex components of the mixtures using PCR, PLS, and NNs. NNs were found to produce the most accurate predictions. However, research has also found that the high level of nonlinear modeling performance by neural network has an associated cost. It was demonstrated both by Blank and Brownls and Pollard et al.*O that NNs require greater sampling frequency (more training samples) in order to decrease the effect of any single sample (including outliers) upon prediction and to decrease the variance of the estimated mean value of the test samples. Another drawback of NNs is the extremely long training time. This might be due to an improper neural network structure (number of (12)Sekulic, S.; Seashoitz, M. B.; Wang, Z.; Kowalski, B. R.; Lee, S.; Holt, B. Anal. Chem. 1993, 65, 835A. (13)Friedman, J. Ann. Stat. 1991, 19, 1. (14)Sekulic, S.; Kowalski, B. R J Chemom. 1992, 6,199. (15)Long. J. R;Gregoriou, V. G.; Gemperline, P. J. Anal. Chem. 1990,62,1791. (16) Gemperline, P.J.; Long, J. R; Gregoriou, V. G. Anal. Chem. 1991,63,2313. (17)Naes, T.; Kvaal, IC;Isaksson, T.; Miller, C.J. Near Infiared Spectrosc. 1993, 1

,

1, 1.

(18) Blank, T. B.; Brown, S. D. Anal. Chem. 1993, 65,3081. (19)Goodacre, R.;Mark, N. J.; Kell, D. B. Anal. Chem. 1994,66,1070. (20)Pollard, J. F.; Broussard, M. R; Banison, D. B.; San, K. Y. Comput. Chem.

Eng. 1992, 16, 253.

Analytical Chemistry, Vol. 67, No. 9,May 1, 7995 1497

hidden neurons, for example) or basis function. The basis function most commonly adopted by NN researchers is the sigmoid function, which is a monotonic nondecreasing differentiable function with a very simple first derivative form that possesses properties conducive to neural computation.21 However, it does not interpolate or extrapolate efficiently in a wide variety of regression applications.21 Several attempts have been made to improve the choice of nonlinear basis functions in NNs, e.g., linear,12,16,22 Gaussian function^,*^-^^ and semiparametric (nonfixed functions) .21 Even though these approaches have certain advantages over standard NNs, the choice of nonlinear basis functions is still ad hoc. Thus far, little work has been focused on selection of proper built-in functional forms in neural network structures. This can largely be attributed to the difficulty in predicting the correct nonlinear behavior within the system of study. In the field of chemistry, many theories have been developed that describe chemical systems or phenomena. In this paper, the concept of chemical neural networks, or ChemNets, is introduced for certain types of applications by taking advantage of previously developed chemical theories and incorporating them into neural network structures. Such a priori knowledge may help in building a model that is one step closer to the true underlying model than a model constructed by a NN. A robust and parsimonious neural structure with good predictive ability is expected for ChemNets. In this paper, the theory of ChemNets is presented, and a ChemNet designed for Taguchi sensors is discussed. The performance of this ChemNet is compared to that of a standard back-propagation neural network.

Figure 1. Points (x/,y/) belonging to a function f X / = 1 , 2 ,..., N.

- Y, 4x1)

= y/,

of xrr. Often 61, is assumed to be independent and identically distributed (iid) as well. The goal of regression is to construct estimates, J,L, ..., A, which are functions of the data {yl, x,!>, 1 = 1, 2, ..., N , to best approximate the unknown functions, fi,h, ..., &, and then to use these estimates to predict a new y given a new x:

9, =3;(x)

i = 1,2, ...,q

Several techniques have been developed in the fields of statistics and approximation theory to find solutions. A common technique involves approximating the function f; by a parametric function Ftb,,x) that has a fixed number of parameters, p , ,which belong to some set P. For a choice of a specific F, (PI&), the problem is then to find a set parameters, p t ,that provide the best possible approximation off; for the set of examples S . This is commonly done by finding those parameters that minimize the least-squares error on the data set, which is equivalent to solving the following minimization problem:

MATHEMATICAL FORMULATION

Most NNs estimate unknown functions from a set of correct input-output pairs, called training examples (or calibration samples in chemometrics). NNs have been proposed for regression problems with no a priori assumptions concerning the unknown functions other than to impose a certain degree of smoothness. A regression problem can be formalized as a problem of approximating a multivariate function from a set of training examples, S , where

s = {Y,,XJ = cy,,,Yl2,

e..,

Ylq;x11, x,,,

**e,

x,,>

E Y@X

which have been generated from unknown functions,

+

yl, =.((xi) eIt

1 = 1,2, ..., N, i = 1, 2, ..., q

where the r-dimension column vector x,! contains independent variables, e.g., digitized intensities from the near-infrared spectrum for sample 1. yii is a scalar dependent variable, e.g., concentration for the ith chemical component in sample 1. f; is an unknown function map from r-dimensional Euclidean space to l-dimensional Euclidean space, as shown in Figure 1,

L:X ti,

+

Y,

i = 1, 2, ..., q

is a random variable with zero mean, E [ E ~=]0, and independent

(21) Hwang, J. N.; Lay, S. R.; Maechler, M.;Martin, D.; Schimert, J. IEEE Trans. Neural Networks 1994,5 (3), 342. (22) Borggaard, C.; Thodberg, H. H. Anal. Chem. 1992,64, 545. (23) Moody, J.; Darken, C. J. Neural Comput. 1989,1, 281. (24) Kosko, B. Neural Networks and Fuuy Systems: A Dynamical Systems Approach to Machine Intelligence; Prentice Hall Englewood Cliffs, NJ, 1992. (25) Park, J.; Sandberg, I. W. Neural Comput. 1991.3,246. (26) Lee, Y. 'Veural Comput. 1991,3,440.

1498 Analytical Chemistty, Vol. 67,No. 9,May 7, 7995

Later, it will be shown that chosen by standard NNs consists of sigmoid bases. The functions F&i,x) chosen by the ChemNet for analyzing Taguchi sensors consist of several bases including sigmoids. NEURAL NETWORKS The feed forward multilayer (perceptron) NNs have one or more layers of hidden neurons between the input and output layers. Several recent r e s ~ l t s have ~ ~ - shown ~ ~ that a two-layer (one output and one hidden layer) neural network with sigmoidal neurons can represent arbitrary continuous functions to any desired accuracy if enough hidden neurons are used.24 A two-layer (one hidden layer with m hidden neurons) neural network, as shown in Figure 2, can be mathematically formulated as follows:

where wio, who denote the biases of the ith neuron in the output layer and the kth neuron in the hidden layer, respectively; wk, denotes the hidden layer weight linked between the kth hidden neuron and the fi neuron of the input layer (or the fi element of the input vector x); ,9ik denotes the output layer weight linked between the ith output neuron and the kth hidden neuron; and gi (27) Gybenko, G. Approximation by superpositions of a sigmoidal function. Technical Report 856 Department of Electrical and Computer Engineering: University of Illinois, 1988. (28) Homik, K.; Stinchcombe, M.; White, H. Neural Networks 1989,2, 359. (29) White, H. Neural Networks 1990,3,535.

approaches depends on the problem, but the sequential approach has proven more effective in most c a ~ e s . 3 ~Therefore, -~~ in this study, the sequential approach is adopted. At each trainiing iteration, the weights are adjusted according to the gradient descent rule,

Figure 2. Two-layer neural network with fixed sigmoid functions in

neurons in hidden and output layers. and gk are the nonlinear basis functions, which are usually assumed to be a fixed sigmoid mapping function, 4%) = 1/(1+ e-3. The above formulation defies explicitly the parametric r e p resentation of functions which are being used to approximate cf;(x), i = 1, 2, ..., 4). Specifically,

Note,$(x) can be obtained by minimizing the sum of squared error as given in eq 1. Therefore, F,(PiJ)= &(XL

with Pj = Vtwl

In neural network terminology, eq 1 is often referred to as the energy function, namely, N

o

Back-PropagationLearning. The training of a feed forward multilayer NN uses back-propagation (BP) learning,30a simple iterative gradient descent algorithm designed to minimize the energy function as given in eq 2. There are two common types of back-propagation: batch and sequential. The batch BP updates the weights after the presentation of the complete set of training data (or input-output pairs). Hence, a training iteration incorporates one sweep through all the training pairs. On the other hand, the sequential BP adjusts the weights as training pairs are presented rather than after a complete pass through the training data. In addition, the presentation of training pairs should be in random order at each pass. This makes the path through weight space stochastic, allowing wider exploration of the energy surface to avoid the local minima that are frequently encountered by optimization techniques.3l The relative effectiveness of the two (30) Rumelhart, D. E.; Hinton, G. E.; Williams, R J. In Parallel Distributed Procesing: .&lorations in the Microstructure of Cognition; Rumelhart, D. E., McClelland, J. L., Eds.: MIT Press: Cambridge, MA, 1986; Vols. I, 11. (31) Hertz, J.; Krogh, A; Pamer, R. G. Introduction to the Theoy of Neural Computation; Addison-Wesley: Redwood City, CA, 1991.

and 3 is a constant parameter often referred to as the learning rate. Variations of BP Neural Networks. Naes et al.17and Hwang et a1.2I gave an in-depth discussion of the strong relationship between BP NNs and other statistical methods. Nevertheless, the model parameters in the statistical methods are determined in a restricted manner. The BP algorithms, on the other hand, are iterative procedures concentrating on fitting with few restrictions, which results in slow and inefficient training. This is due in part to the built-in sigmoid functions and is a strong indication that the sigmoid function may not be a proper basis function for fitting certain functions including linear functions or very complex nonlinear functions. Several attempts have been proposed to improve the choice of neural network basis functions. One modification was suggested by Borggaard and Thodberg,22and Sekulic et al.,12 who added direct linear connections between the input and output layers, so that the network is composed of a linear part and a standard nonlinear part. Such networks are referred to as direct linear feedthrough (DLF) networks and have been shown to perform better than standard BP NNs when linearities are present in the data. Gemperline et al.I6 incorporated different linear and nonlinear functions into the hidden neurons to facilitate training. The functions tested included a h e a r function, g(z) = 5 a quadratic function, g(z) = z2; a sigmoid function; and a hyperbolic tangent function, g(z) = tanh(z). A trial-and-error approach was used to select the number of hidden neurons and the functions used in the hidden neurons. The most exciting aspect of this approach is that the weights can be automatically adjusted by the training process to accommodate linear responses and different types of nonlinear responses as they occur in different spectral regions. Recently, more r e ~ e a r c h ~has ~ - ~focused ~ on investigating multivariate Gaussian or radial basis function (RBF) networks due to their ability to model nonlinear functions. The training algorithmz3for RBF networks adjusts the RBF centers and the RBF widths and heights according to the gradient descent rule. All the results showed that RBF networks were much faster in training than the standard NNs. However, as shown by Hartman and Keeler,34the time advantage diminished in high-dimensional input spaces. In certain cases, RBF networks produced more accurate results. Hwang et a1.21proposed a projection pursuit network (PPN) by combining NNs with projection pursuit regression. Both NNs and PPNs are based on projections of the data in directions determined from interconnection weights. However, unlike the (32) Robbins, H.; Monro, S. Ann. Math. Stat. 1951,22, 400. (33) White, H. Neural Comput. 1989,1, 425. (34) Hartman, E.; Keeler, J. D. Neural Comput. 1991,3, 566.

Analytical Chemistry, Vol. 67, No. 9,May 1, 1995

1499

f

t

.it

t f

Figure 3. Two-layer ChemNet. The g functions are from chemical theories.

use of fixed nonlinear sigmoid functions for hidden neurons in a neural network, a PPN systematically approximates the unknown nonlinear function through a supersmoother algorithm and orthogonal polynomials. Simulations have shown that both NNs and PPNs have quite comparable training speed and achieve comparable accuracy for test data, but PPNs are considerably more parsimonious in that fewer neurons are required. The above methods attempt to approximate unknown functions by incorporating various basis functions other than the commonly used sigmoids into NNs. Interestingly, all showed certain advantages. However, like selection of the sigmoid function, selection of basis functions is still ad hoc. These networks are rarely the optimal models to estimate the true underlying unknown functions. CHEMNETS

Recent research involving the parsimony principle reveals another problem of NNs. Seasholtz and K ~ w a l s kstudied i ~ ~ the parsimony principle formally in order to understand under what circumstances the various multivariate methods are appropriate. Considerable in-depth theoretical work proved the parsimony principle, which states, “If two models in some way adequately model a given set of data, the one that is described by a fewer number of parameters will have better predictive ability given new data.” Similar research done on NNs by B a ~ alsoshowed m ~ ~ that, “The neural networks with optimal generalization ability are characterized by having the fewest weights while still processing the training data correctly.” Several recent s t u d i e ~have ~ ’ ~shown ~ that some multivariate statistical methods perform as well as NNs with simpler models, namely, fewer parameters in the models. This indicates that NNs are not optimal. With information currently being generated at explosive speed, development of data modeling methods is under active research. However, all research efforts are facing a mathematical dilemma that was summarized by Dr. Albert Einstein37in 1922: “As far as the laws of mathematics refer to reality, they are not certain; and as far as they are certain, they do not refer to reality.” In mathematical modeling, there are basically two approaches, theoretical and empirical. As for the theoretical approach, many (35) Seasholtz, M. R; Kowalski, B. R. Anal. Chim. Acta 1993,277, 165. (36) Baum, E. D.; Haussler, D. Neural Comput. 1989,I , 151. (37) Einstein, A. Sidelights on Relativity; E. P. Dutton: New York, 1923.

1500 Analytical Chemistry, Vol. 67,No. 9,May 1, 1995

theories have been developed in chemistry to describe chemical systems and various phenomena. These theories are represented by sets of mathematical equations. Since the theories were developed under simplified assumptions, they cannot account for complex chemical systems. Therefore, they are not always useful in real applications. Often data modeling has to resort to an empirical approach, like multivariate statistical methods or NNs. Overfitting is the biggest problem for empirical approaches. The models constructed by these methods have to be validated with a large test (prediction) data set. To make things worse, no completely reliable validation statistic for prediction accuracy has yet been developed. Facing the dilemma, an optimum method is envisioned that takes advantage of the positive merits of both theoretical and empirical modeling approaches. This is the motivation behind the design of ChemNets. As pointed out by Poggio and G i r o ~ i , ~ ~ for any approximation scheme it is very important to choose an approximating function F(p,x), as shown in eq 1, that can representf(x) as closely as possible. There would be little point in trying to learn if the optimal approximation function F(p,x) could only give a very poor representation off(x). Pollard et aLZ0 conducted an extensive study to compare NNs with several theoretical approaches based on simplified assumptions. In conclusion, Pollard et al. stated, “It would be a significant accomplishment to develop neural network paradigms which allow the user to incorporate arbitrary a priori information.” The idea behind ChemNets is to incorporate chemical theories into neural network structures, as shown in Figure 3. The structure of a ChemNet is similar to the NN structure shown in Figure 2. The function gi embedded in the structures are most commonly sigmoidal but can be linear or nonlinear, depending on the nature of the applied chemical theory. The sigmoid functions may be retained in the structure to model some unknown functional relationships that are not accounted for by theory but are present in real data. As an example, a DLF12.22 network can be viewed as a type of ChemNets. The linear part of the network complies with the Beer-Lambert law in chemistry. The nonlinear part containing the sigmoid function models nonliiearities that the Beer-Lambert law cannot handle as well as uncertainties often associated with real data. ChemNets are designed to achieve optimal net structure with a minimum number of net parameters including weights. This can be achieved by incorporating chemical theory into neural net structures. Even though chemical theory may not be completely accurate, functional forms on this basis are still much closer to the “true”functions than arbitrary structures built upon sigmoids. Therefore, chemical theory can offer more efficient estimates of the unknown functions. One drawback of ChemNets is that they cannot be as general as NNs. They have to be designed for very specific applications. In the next section, the design of a ChemNet for analyzing Taguchi sensor array data is discussed. CHEMNET FOR TAGUCHI SENSORS

The metal oxide gas sensor is a unique and important type of chemical sensor. The pioneering work of using sintered tin oxidebased sensors to detect oxidizable reducing agents was carried out independently by T a g u ~ hand i ~ ~Seiyam et aL40 The property (38) Poggio, T. A; Girosi. F. Ezplon’ngBrain Functions: Models in Neuroscience; John Wiley & Sons: New York, 1993; pp 78-96, (39) Taguchi, R. Japanese Patent 4538200, 1962.

advantage of the inverse model is that calibration for concentrations of analytes can still be carried out even in the presence of unknown interferences. However, by closely examining eq 3, no straightforward inverse function is available. In other words, concentration I;.cannot be easily expressed as a function of sensor response (Xi], namely, #

j = 1, ...,J

gj(x,, ...,XI)

However, if parameters {mij} are set to be only gas specific, namely, mij = mi, then eq 3 can be simplified as

Figure 4. ChemNet for Taguchi gas sensors.

By rearranging eq 4, measured is the resistance of tin oxide which decreases when exposed to a reducing gas. In the area of chemical gas sensors, one of the new developments since the mid 1980shas been sensor array in~trumentation!~-~~ In many real-world problems, samples are mixtures of several components. The dficulty of multicomponent monitoring and analysis arises from the fact that most gas sensors are only partially selective. Sensor array instrumentation overcomes this nonselectivity dficulty by using multivariate calibration methods. The sensing mechanisms of Taguchi sensors, due to both the physics of semiconductor materials and the interfacial electrochemistry, are nonlinear functions based on a number of device operation parameters. The relationship which describes the sensor signals as a function of several gas concentrations can be derived from chemical and physical models introduced by Clifford& and Monison!j In a modfied notation,46 the ith sensor response X , can be expressed as a function ofJgas concentrations, (k;) by

where Ri denotes the sensor resistance in air, Aij and mv are gasand sensor-specific parameters, and Bi is specific to the sensor element. Equation 3 is a typical classical model in chemometrics,47-*9 namely, where sensor response is a function of gas concentrations. A major disadvantage of classical models is that concentrations of all components present, including interferences, have to be known before the equation can be used. This is why the inverse modeljO*jlis preferred in chemometrics, namely, where the concentration of one or more analytes is modeled as a function of instrument measurements (sensor responses for example).The (40) Seiyama, T.; Kato, A; Fujishi, IC; Nagatani, M. Anal. Chem. 1962,34,1502. (41) Carey, W. P.;Beebe, IC R; Sanchez, E.; Geladi, P.; Kowalski, B. R Sens. Actuators 1986,9,223. (42) Stetter, J. R; Jurs, P. C.; Rose, S. L.Anal. Chem. 1986,58, 860. (43) Carey, W. P.;Beebe, IC R; Kowalski, B. R Anal. Chem. 1987,59, 1529. (44)Clifford, P.IC Homogeneous semiconducting gas sensors: a comprehensive model. Anal. Chem. Symp. Ser. 1983,No. 17, 135. (45)Morrison, R Sens. Actuators 1987,11, 283. (46)Homer, G.;Hierold, C. Sem. Actuaton 1990,B2,173. (47)Haaland, D. M.;Easterling, R G. Appl. Spectrosc. 1982,36, 665. (48)Haaland, D. M.;Thomas, E. V. Anal. Chem. 1988,60, 1193. (49) Neter, J.; Wasserman, W.; Huntner, M. H. Applied Linear Regression Models, 2nd ed.; Irwin: Boston, MA, 1989. (50) Sanchez, E.; Kowalski, B. R. J Chemom. 1988,2,247. (51) Sanchez, E.; Kowalski, B. R.J Chemom. 1988,2,265. (52) Carey, W. P.; Yee, S. Sets. Actuators 1992,9,113.

j= 1

Since Ri and Bi are parameters to be determined, the above equation is equivalent to the following equation as long as R, and Bi are not zero: J

=1

+ EAijqmj j= 1

As shown in Appendix 1, the following equation,

t=l

is equivalent to eq 5. Based on eq 6, a ChemNet structure can be designed as shown in Figure 4. The ChemNet is a four-layer structure that consists of two major parts. The left part (referred to as the theoretical part in the following) is based on sensor theory. The function g, in the first hidden layer is taken from eq 6, that is, g,(h,) = h,, where h, is the input to the hidden neuron i. The node with constant input 1 (denoted a) is used to introduce the bias term W, as shown in eq 6. Note that, as shown in Appendix 1, W, is constrained to be -Ci=,W,,, while in the ChemNet, this constraint is relaxed. More specifically, W, is adjusted according to the optimization procedure during the training process. The function g, in the second layer is also taken from eq 6, that is, &(HI)= H/”,, where H, is the net input to the hidden neuron j . The right-hand part of Figure 4 is the standard neural network with sigmoid functions. As mentioned earlier, this part is introduced to model the residual variance that cannot be handled by sensor theory in real situations. The upper layer (or output layer) is deliberately designed to combine the theoretical part and the neural net part. Network output can be obtained by

+

(7) 0, = tlq np, where T, and 8 are the outputs from the the‘ory and neural net parts, respectively, and are weighted by 6 and n, before they contribute to the $h final output 0,. T, can be obtained from eq 6. After the standard BP training of all the weight parameters (R,,B,, q,, m,) of the ChemNet shown in Figure 4, the interpretation of weights 4 and n, in the output layer can be performed. In other words, the relative magnitudes of contributions from the Analytical Chemistry, Vol. 67, No. 9, May 1, 1995

1501

theoretical part and the neural net part to the final results can be determined. In order to do this, both and 4are scaled between 0 and 1so that minimum and maximum contributions are between 0 and 1 from both parts. The scaling, without changing the final output, can be obtained by changing eq 7 to

aE

-_ aRi

yw

where and are maximum and minimum theory contributions, respectively, for output j , while A$".. and are maximum and minimum neural net contributions, respectively, for output j . The scaled contribution can be reformulated as

Tin

oj= tj;i.+ %,Ei+ e

aEaOjaqaHiaoi ahi ?aOiaqaq

aoi ahiaRi

MODEL VALIDATION

In order to make a comparison of performance among various methods, the fitting and prediction errors are calculated. The fitting errors are calculated as the root-mean-square error of fitting (RMSEF), N'

(8)

where where N, denotes the number of calibration (training) samples, measured reference value (concentration, for example), and j , ~ denotes the value estimated by the modeling method. The purpose of calibration in analytical chemistry is for the prediction of properties in unknown samples. When both a calibration set and an additional prediction set are available, it is possible to use the root-mean-square error of prediction (RMSEP) to test the model according to

yci denotes the true

e = t l T i n+ n

r

% and 4 are scaled outputs from the theoretical and neural net parts, respectively. 7, and are scaled weightings for the theoretical and neural net parts, respectively. The extra constant 8 does not alter the interpretation of the contributions since it simply adds a constant offset to the h a l outputs. After the scaling, the percentage contribution of the theoretical part of the concentration estimate for component j can be obtained according to -

% theor contribq =

ti x 100 tj + Faj

Gradient Descent Equations. The weights in the neural net part can be obtained the same way as with standard NNs based on BP training. The weights, functional parameters mi, FitBit and Rj in the theoretical part, can also be obtained through the gradient descent method used in a BP algorithm. However, the gradient learning equations are much more complicated, as shown in the following equations. Updating is carried out by minimizing,

where E; is the desired value of the fih component concentration.

= - [E; - Oil wjiHjmjIn (Hi)

1502 Analytical Chemistry, Vol. 67, No. 9, May 1, 7995

N"

where Np denotes the number of prediction (test) samples, yp1 denotes the true reference value, and j,, denotes the value predicted by the modeling method. All values are converted to indicate the percent relative fitting or prediction error. For example, the percent relative RMSEF is obtained according to

%RMSEF= RMSEF(lOO/yc) where is the mean reference value in the calibration set of samples. A similar expression is used for the percent relative RMSEP, where ypindicates the mean reference value of the prediction set of samples. EXPERIMENTAL SECTION

Data are based on measurements of seven Taguchi gas sensors (TGSs) for two-component mixtures of toluene and benzene. The TGS sensor array by Figaro Inc. consisted of two TGS 823s, two TGS 816s, TGS 824, TGS 815, and TGS 825. The array of seven sensors was arranged in a linear fashion in a flow cell where air was passed at a constant flow rate of 200 mL/min. Organic solvent vapors of toluene and benzene were generated using bubblers at a temperature of -15 "C for toluene and +5 "C for benzene. The flow of the air to the sensor block and bubblers was controlled by mass-flow controllers (Tylan Inc.) linked to a data acquisition and control computer. Solvent vapor concentrations in the range of 5-500 ppm could be generated. For each sample, the vapors were generated for 10 min to allow adequate equilibrium time with the sensors, and between samples the flow cell was purged

Table 1. Results from ChemNets and Neural Networks RMSEF (%)

RMSEP (%)

ChemNet (%)

RMSEF (%)

toluene RMSEP 6)

20.44 14.77 13.33 11.95 14.42 10.48 10.65

23.19 14.12 11.54 11.02 16.89 14.87 11.35

100 98 97 86 96 97

18.40 16.69 14.36 10.19 14.15 11.88

20.62 15.18 14.34 17.09 12.68 16.43

14.10

14.56

benzene ChemNet ChemNet(1) ChemNet (2) I?

ChemNet(3) ChemNet(4) ChemNet(5) "(6) "(8) I?

ChemNet (%)

parameters

epochs

100

15 27 36 45 52 63 55 73

821 1513 1795 2750 3467 16394 4752 21563

99 96 33 97 69

Optimum ChemNet model chosen.

with air for 45 min. In order to compare the performance of various methods, the data were divided into a calibration set and a prediction set, each containing 50 samples. Since the ChemNet includes a part of a standard neural network, the concentrations values for both benzene and toluene are scaled to be between 0 and 1. This is due to the use of built in sigmoids so that output values lrom the neural network are between 0 and 1. The sensor responses are between 0 and 1, so no preprocessing was needed for the input variables. RESULTS AND DISCUSSION The relative RMSEF and RMSEP results from a neural network and the ChemNet are provided in Table 1. As stated before, BP algorithms are iterative procedures concentrating on fitting without restrictions. It should be noted here that since the optimization by BP is highly dependent on the initial random weights, the results reported for the networks in the table are in fact the best among many trials with different initial random weights. The number of weight parameters is provided in the table to allow comparison of the complexities of the network structures. In addition, the table also provides the number of epochs (passes through the training set) that the networks took to reach the minimum RMSEFs. The relative RMSEP results for concentrations of benzene and toluene from the standard neural network are 11.35%and 14.56%,respectively. Two separate NNs were used to build models to predict the concentrations of benzene and toluene. The result for benzene was obtained with six hidden nodes. Therefore, a total of 55 weight parameters were used in the structure. The result for toluene was obtained with eight hidden nodes and included a total of 73 weights. The ChemNet was first implemented without the standard neural net part. Obviously, the results were not satisfactory,which suggested that the theory alone was insufficient. Therefore, hidden neurons for the neural net part were added one at a time. ChemNets with up to five hidden neurons were tested. The number of weights for ChemNets with 1-5 hidden neurons are 15, 27, 36, 45, and 52, respectively. Table 1 shows that the ChemNet with one hidden neuron (denoted ChemNet(1) in the following) improved both relative RMSEF and RMSEP significantly over the ChemNet with no hidden neurons. However, the results are still not as good as those for a standard BP NNs. ChemNet (2) produced results much improved over ChemNet(1). As a matter of fact, ChemNet(2) is chosen as the optimal ChemNet model because the ChemNets with 1,3,4, and 5 hidden neurons did not produce as good results for both benzene and toluene. The best relative RMSEP result for benzene, 11.02%,is

obtained with ChemNetQ, but it is not signi6cantly different from that of 11.54%with ChemNet(2). The best relative RMSEP result for toluene, 12.68%,is obtained with ChemNet(4). This is slightly better than 14.34%with ChemNet(2). The most important reason that ChemNet(2) was chosen over ChemNet(4) for the estimation of toluene is that fewer parameters were used (i.e., 36 compared with 52). According to both the parsimony principle and the minimal NNs theory, with additional samples the fewer parameters of ChemNet(2) are favored over ChemNet(4), which has 52 weights in terms of predictive ability. Predictive ability is the performance on a test data set not used in the training of the networks. ChemNet(2) produces results comparable to those from standard NNs. The RMSEP for benzene, 11.54%,is slightly worse than 11.35%with NNs, while the RMSEP for toluene, 14.34%,is slightly better than 14.56%with NNs. However, most importantly, ChemNet(2) is more parsimonious than the NNs. ChemNet(2) used only 36 parameters (weights) to build models for both benzene and toluene. NNs used 55 and 73 parameters for benzene and toluene, respectively. Again, according to the parsimony principle, Chemnet(2) is more favorable. The training of ChemNet(2) with only 36 parameters is much faster than NNs with 55 or 73 parameters, as evidenced by the corresponding epochs. The training of NNs took more than twice (4752) as much time as ChemNet(2) (1795) for modeling benzene and increased calculation by a factor of 12 (21 563) as compared to ChemNet(2) (1795) for modeling toluene. Another interesting phenomenon is the contribution from the ChemNets to the final results. As can be seen, the percentage contributions from the ChemNets are very high in most cases (above 95%) compared with ones from the NNs. In two cases (ChemNet(3) and ChemNet(5)), the percentage contributions from the ChemNets are 33%and 69%for modeling the concentrations of toluene, respectively. Notice that in both cases severe overfitting occurred as compared to other cases which result in prediction errors (%RMSEP) that are much higher than fitting errors (%RMSEF). Normally, whenever a model has been overfit, it will lose its predictive ability. Therefore, ChemNets with high contributions from the theoretical part can overcome the overtitling problems. It should also be noted that ChemNet(4) did not overfit for modeling the concentration of toluene and showed very good predictive ability. This might be due to two reasons. First, as stated before, the table includes only the best results among many trials with different initial random weights. It was observed that training of ChemNets with less than three neurons in neural Analytical Chemistry, Vol. 67, No. 9, May 1, 1995 1503

net part was stable and reproducible. The training of ChemNet(4), on the other hand, was heavily dependent on the initial weights. In the worst cases, the percentage contributions from the ChemNets were just as bad as shown here for ChemNet(3) and ChemNet(5). Second, closely examining the weights of ChemNet(4) in the theoretical part reveals that they are close to the ones in ChemNet(2), which is why in this case the theoretical part has a high contribution to the model. Since the optimization of ChemNet is started with random weights initially, it is reasonable to believe that the theoretical part of the ChemNet(4) model happened to converge closely with the one of ChemNet(2) in modeling the concentrations of toluene. In this work, Dr. Einstein's statement holds true as the sensor theory cannot alone be successful. This is due to imperfection of the theory, which operates under several simplified assump tions. This is also due to the assumption that parameter mij must be gas specific in order to build an inverse ChemNet model. Because of those uncertainties, the contribution from the neural net part, no matter how small, plays an important role in the success of the ChemNet in this application.

of Washington. APPENDIX 1

From eq 5, J

(RoJi)-B1 = 1 + CAjj.Y,mj j= 1

Rearranging eq lA,

Equation 2A can be expressed as follows in matrix form,

CONCLUSIONS

The concept of ChemNets was introduced in order to deal with a mathematical modeling dilemma. The essence of ChemNets is to incorporate chemical theories into NNs to achieve a model one step closer to the true underlying model. Therefore, a good predictive ability should be expected from ChemNets. The ChemNets constructed on the basis of Taguchi sensor theory showed a significant advantage in terms of parsimony; however, a ChemNet without the neural net part was not successful, regardless of the size of the contribution. This is due to theoretical uncertainties and real-life measurement uncertainties. There are many other aspects of ChemNets that need to be studied in the future, e.g., the stability of ChemNet models in the presence of noise, the effects of beginning with a different ChemNet initialization, and the consequences of ending training prematurely due to the presence of local minima. Since ChemNets include chemical theory in the structure, better interpolation/ extrapolation abilities should be expected. ChemNets provides a neural network paradigm that allows users to incorporate a priori information. It is hoped that the introduction of ChemNets will open a new field of multivariate calibration methods in chemometrics. Since the method shows great promise and incorporates chemical theory, it is anticipated that more applications will be discovered in the near future. One potential area that might have immediate benefits from the ChemNet concept is the modeling of microelectrical chemical sensors (e.g., Chemfet). ACKNOWLEDGMENT

The authors acknowledge Dr. Patrick Carey at the Department of Electrical Engineering at the University of Washington for providing the Taguchi data set. Dr. Mary Beth Seasholtz at Dow Chemical, Inc., is acknowledged for her help and encouragement in this work. Paul Mobley at the Department of Chemistry, University of Washington, is also acknowledged for his comments and help. This research was supported by the Center for Process Analytical Chemistry (CPAC), a National Science Foundation/ Industry/University Cooperative Research Center at the University 1504 Analytical Chemistry, Vol. 67, No. 9, May 7, 1995

If the J sensors are different and number of training samples I > J, then the rank of A matrix should be full, namely, r a n k ( 4 = J. Therefore, eq 3A can be solved by least squares,

(44

where W = (ATA)-ATand the dimensions of W are J Therefore, from eq 4A,

x

I.

where

w,,= -cyi I

i=l

Therefore, eqs 5A and lA are equivalent as long as the rank of A is full. Assuming that the sensor responses are diflerent and the number of sensors is less than the number of samples, the full rank of A is generally true. Received for review October 10,1994. Accepted February 7, 1995.@ AC940996U Abstract published in Advance ACS Abstracts, March 15, 1995.