Anal. Chem. 1990, 62, 1791-1797
Now we divide by the number of molecules that flow through the beam per unit time to determine the number of counts per molecule, ne If we take the boundary of the beam to be defined by the l / e 2radius, then the number of molecules per second is given by C & ( ~ W ) where ~ / T ~it = 2w/v. Meanwhile, the background counts per transit time will be given by
where gb is a dimensionless scattering cross section that contains the concentration of the scattering species and the height of the imaged volume and P is the total laser power. These expressions can be rewritten in terms of the reduced variable k = uJo/kf where k is now the normalized excitation rate at the maximum of the Gaussian and T = Tt f i d = kdrt = 2kdw/v where is the maximum value of the transit time through the Gaussian beam. This gives for n f / n b 1 / 2
where
Numerical integration of this equation reveals S I N profiles that are similar to those for the square beam profile except that the saturation is softer with respect to light intensity as a result of the Gaussian profile. The generalization of these equations to the more complicated cases described by eqs 12 and 14 is straightforward. LITERATURE CITED (1) Shapiro, H. M. Practical Flow Cvtometw. 2nd ed.; Alan R. Liss. Inc.: New York, 1988. (2) Ansorge, W.; Rosenthal, A.; Sproat, B.; Schwager, C.; Stegemann, J.; Voss, H. Nuclelc AcMs Res. 1888, 16, 2203-2208.
1791
(3) Smith, L. M.; Sanders, J. 2.; Kaiser, R. J.; Hughes, P.; Dodd, C.; Connell, C. R.; Heiner, C.; Kent, S. B. H.;Hood, L. E. Nature 1986,321, 674-679. (4) Prober, J. M.; Trainor, G. L.; Dam, R. J.; Hobbs, F. W.; Robertson, C. W; Zagursky, R. J.; Cocuzza, A. J.; Jensen, M. A,; Baumeister, K. Science 1987,238, 336-34 1. (5) Shuman, H.; Murray, J. M.; DILullo, C. BioTechniques 1989, 7, 154-1 63. (6) Hiraoka, Y.; Sedat, J. W.; Agard, D. A. Science lB87,238, 36-41. (7) Cheng, Y.-F.; Dovichl, N. J. Science 1988,242, 562-564. (8)Taylor, D. L., Waggoner, A. S.,Murphy, R. F., Lanni, F., Birge, R., Eds. Applicafions of Fluorescence in the Biomedical Sciences; Alan R. Liss, Inc.: New York, 1986. (9) Jovin, T. M.; ArndtJovin, D. Annu. Rev. Bkmhys. B/oDhYS. Chem. . . . . 1989, 18, 271-308. (10) Webb, W. W. Ann. N . Y . Acad. Sci. 1988,483, 387-391. (11) Wu, S.;Dovichi, N. J. J. Chromatogr. 1989,480, 141-155. (12) Hirschfeld, T. Appl. Opt. 1976, 15, 2965-2966. (13) Peck, K.; Stryer, L.; Glazer, A. N.; Mathies, R. A. Roc. Net/. Acad. Sci. U . S . A . 1989,86, 4087-4091. (14) Nguyen, D. C.; Keller, R. A.; Jett, J. H.; Martin, J. C. Anal. Chem. 1987,59, 2158-2161. (15) HirschfeM, T. Appl. Opt. 1976, 15, 3135-3139. (16) Mathies, R. A.; Stryer, L. Single-molecule fluorescence detection: A feasibility study uslng phycoerythrin. I n Applications of Fluorescence in the Biomedical Sciences; Taylor, D. L., Waggoner, A. S., Lannl, F., Murphy, R. F., Birge, R., Eds.; Alan R. Liss, Inc.: New York, 1986; pp 129- 140. 17) White, J. C.; Stryer, L. Anal. Biochem. 1987, 161, 442-452. 18) Nguyen, D. C.; Keller, R. A.; Trkula, M. J. Opt. SOC. Am. B : Opt. PhyS. 1987,4 , 138-143. 19) Tsien, R. Y.; Waggoner, A. Fluorophores for confocal microscopy: photophysics and photochemistry. I n Handbook of BioMical Confocal Microscopy; Pawley, J., Ed.; Plenum Press: New York, 1990; pp 169- 178. 20) Dovichi, N. J.; Martin, J. C.; Jett, J. H.; Trkula, M.; Keller, R. A. Anal. Chem. 1984, 56, 348-354. (21) Glazer, A. N.; Peck, K.; Mathies, R. A. Proc. Natl. Aced. Sci. U . S . A . 1990,8 7 , 3851-3855. (22) Mathies, R . A.; Peck, K.; Stryer, L. Roc. SPIE-Int. SOC. Opt. f n g . , in press. (23) Mathies, R.; Oseroff, A. R.; Stryer, L. Proc. Natl. Acad. Sci. U . S . A . 1976, 73, 1-5.
RECEIVED for review February 20,1990. Accepted May 18, 1990. This research was supported by the National Science Foundation (BBS 87-20382),by the National Institutes of Health (GM 24032), and by the Director, Office of Energy Research, Office of Health and Environmental Research, Physical and Technological Research Division of the U.S. Department of Energy under Contract DE-FG03-88ER60706.
Spectroscopic Calibration and Quantitation Using Artificial Neural Networks J a m e s R. Long, Vasilis G. Gregoriou, a n d P a u l J. Gemperline* Department of Chemistry, East Carolina University, Greenville, North Carolina 27858 Thls artlcle demonstrates the appllcatlon of arttflclai neural networks for nonlinear muitlvarlate callbratlon using spectroscopic data. Neural networks conslstlng of three layers of nodes were trained by using the back-propagation learning rule. Slgmold output functlons were used In the hldden layer to facllttate nonlinear fittlng. Adlustable network parameters were optlmlzed by uskrg simulated data. The effect of random error In the concentratlon varlables and In the response varlables was Investigated. The technlque was tested by uslng real data, lncludlng an example showing the determlnatlon of protein In wheat uslng near-infrared spectroscoplc data and two examples showing the quantitatlon of the ingredients In phannaceutlcalproducts uslng uttravloiet-vislble spectroscoplc data.
* Corresponding author.
INTRODUCTION In the past few years, the topic of neural computing has generated widespread interest and popularity (1,2). Neural computing is usually implemented by using artificial neural networks. The popularity of this technique is due in part to the analogy between artificial neural networks and biological neural networks. Numerous desirable properties are attributed to artificial neural networks because of this biological analogy. Artificial neural networks are thought to have the ability to “learn” during a training process where they are presented with a sequence of stimuli (inputs) and a set of expected responses (outputs). Learning is said to happen when the artificial neural network arrives at a generalized solution for a class of problems. Numerous applications have been investigated by using artificial neural networks including pattern recognition, signal processing, process control, and modeling. Examples of pattern recognition applications are
0003-2700/90/0362-1791$02.50/00 1990 American Chemical Society
1792
ANALYTICAL CHEMISTRY, VOL. 62, NO. 17, SEPTEMBER 1, 1990
inputs
Output Variabie(s) ieg. concentration)
a, weights Ou1put Layer
K nodeis)
a, Flgure 1.
network.
Schematic representation of a node in an artificial neural
Input Lager
I node(s)
10 3
Input Variable(s1 (eg. absorbance values)
Flgure 3.
network.
,’ //
,
-1
O2 0 0
i1,
-60
,
/’ I
i
/’ ,, ,
;/, -40
,
, ,, -20
,
, , , ,, , , , , , , , , , , , , , , , 20
00
, , ,, , , , ,,, , , , 40
60
X
Figure 2. Plot of the sigmoid transfer 1.0 (-): sigmoid gain set to 2.0 (---).
function: sigmoid gain set to
image recognition (3, 4 ) and speech recognition (5-7). Examples of signal-processing applications are the filtering of noisy signals (8)and the analysis of time series (9). In system control and process control applications, input signals consist of multivariate on-line measurements and the output signals of the neural network are used to control crucial process parameters to optimize product quality or minimize cost (IO, I I ) . Modeling applications generally involve the reproduction of patterns (8). In this paper, we demonstrate how artificial neural networks can be used to model spectra of mixtures to produce quantitative estimates of the concentrations of the components in the mixtures. The results obtained by using artificial neural networks are benchmarked against the results obtained by using principal component regression.
THEORY The fundamental processing element of an artificial neural network is a node (see Figure 1). Nodes are analogous to neurons in biological neural networks. Each node has a series of weighted inputs, w,, which may be either external signals or the output from other nodes. In our application, the external signals are absorbance values, A,. The inputs to the node are analogous to synapses, and the weights correspond to the strength of the synaptic connection. Inputs having negative weights are analogous to inhibitory synapses, and inputs having positive weights are analogous to excitory inputs. The sum of the weighted inputs is transformed with a linear or nonlinear transfer function. A popular nonlinear transformation function is the sigmoid function shown in Figure 2 and eq 1 (12).
The function has an output in the range from 0 to 1, where x is the weighted sum of the inputs and 0 is the gain. The
a1
t
a2
t
t
a3
a4
Schematic representation of a three-layer artificial neural
gain serves to modify the shape of the sigmoid curve. A small value for the gain gives the sigmoid function a very steep transition from 0 to 1,whereas a large value for the gain gives a more gentle slope for the transition from 0 to 1. Other transfer functions have been investigated including the sine function, the hyperbolic tangent, and simple linear functions. In this paper, a feed-forward network was constructed by using three layers of nodes: an input layer, a hidden layer, and an output layer, as shown in Figure 3. In this application, the input signals are absorbance values, Ai,measured at I wavelengths in a spectrum. log (l/reflectance)i values were used in the wheat example. There is one input node per variable in a spectrum. The input nodes transfer the weighted input signals to the nodes in the hidden layer. A connection between node i in the input layer and node j in the hidden layer is represented by the weighting factor w,,;thus, there is a vector of weights, wj, for each of the J nodes in the hidden layer. These weights are adjusted during the learning process. Each layer also has one bias input, as shown in Figure 3, to accommodate nonzero offsets in the data. The value of the bias input is always set to 1.0. A term is included in the vector of weights to connect the bias to the corresponding layer. This weight is also automatically adjusted during the training process. The number of hidden nodes is an adjustable parameter. To a first approximation, the number of hidden nodes determines the complexity of the neural network. Increasing the number of nodes in the hidden layer is roughly analogous to increasing the number of principal components used in a principal component regression. The output of each hidden node is a sigmoid function of the sum of that node’s weighted inputs. The gain, 8, in the sigmoid function is also an adjustable parameter. The outputs from each node in the hidden layer are sent to each node in the output layer. For our calibration applications, only one output node was used in the output layer, having an output equal to the scaled concentration of the component of interest. A separate network was trained for each component in the mixture. The concentration values were scaled to lie in the range from 0.2 to 0.8 by adding an offset and multiplying by a constant. Scaled outputs are necessary to accommodate the bounded range of the sigmoid output function (0-1.0) when it is used for the output node. Scaling is not necessary when a linear function is used for the output node. During the learning procedure, a series of input patterns (e.g., spectra) with their corresponding expected output values (e.g., scaled concentrations) are presented to the network in an iterative fashion while the weights are adjusted. The
1793
ANALYTICAL CHEMISTRY, VOL. 62, NO. 17, SEPTEMBER 1, 1990
training process is terminated when the desired level of precision is achieved between the expected output and the actual output. In this work, the error in the expected output is back-propagated through the network by using the generalized delta rule to determine the adjustments to the weights (12). When a linear output function is used, the output layer error term is given by 8pk
=
tpk
- Opk
where bpk is the error term for observation p at output node 12, t p k is the expected output for observation p , and opb is the actual node output. When the output transfer function is a sigmoid, the output layer error term is given by (3) This equation is similar to eq 2 but has been multiplied by the derivative of the sigmoid function, O p k ( 1 - O p k ) . The error term at node j of the hidden layer that uses a sigmoid transfer function is the derivative of the sigmoid function multiplied by the s u m of the products of the output error terms and the weights in the output layer according to (12) K
(4)
The error terms from the output and hidden layers are back-propagated through the network by making adjustments to the weights of their respective layers. Weight adjustments, or delta weights, are calculated according to (12) Where A w j i is the change in the weight between node j in the hidden layer and node i in the input layer. In eq 5 , is~the learning rate, bpi is the error term for observation p at node j of the hidden layer, opi is the observed output for node i of the input layer for observation p , and a is the momentum. The terms n and n - 1refer to the present iteration and the previous iteration, respectively. The presentation of the entire set of p training observations is repeated when the number of iterations, n, exceeds p . An equation similar to eq 5 is used to adjust the weights connecting the hidden layer of nodes to the nodes in the output layer. Prior to the start of training, all of the weights in the network are set to random values and the learning rate and the momentum are initialized. These two constants are sometimes adjusted to smaller values according to an empirical annealing schedule as learning progresses in order to find a global minimum of the error function. The learning rate determines the rate at which information is encoded into the network. When this constant is set too high, a local minimum may be encountered during the descent down the error surface. If it is set too low, the rate of learning can be too slow. To help resolve this dichotomy, Rumelhart et al. suggested the use of a momentum term that would act to reinforce the general trends in the changes in the weights, filter out highfrequency fluctuations, and increase the speed of lower learning rates (12). Numerous useful and beneficial properties of neural networks have been claimed. When sigmoid transfer functions are used, linear as well as nonlinear applications can be handled easily. In the case of linear applications, weights are automatically adjusted so that the midsection of the sigmoid response function is used to achieve a good linear approximation. When nonlinear response is present, weights are automatically adjusted so that curved portions of the sigmoid transfer functions are used. It has also been claimed that an arbitrary nonlinear mapping of input domains to output domains can be achieved by using three layers in artificial neural networks (13).
200
210
220
230
240
250
Wavelength Figure 4. Simulated spectra: component A (-);
component B (---).
Parsimonious models can be obtained by using artificial neural networks by reducing the number of nodes in the hidden layers to the minimum number that gives acceptable performance. Artificial neural networks have been observed to give stable behavior for subtle perturbations and random noise in the input signals compared to other nonlinear methods. This behavior is attributed to the signal-averaging effect of the summations and the bounded output domain of the sigmoid transfer function. In addition, neural networks have been observed to be fault tolerant due to the automatic incorporation of redundant nodes. Many workers in this field have observed that minimal changes occur in the expected outputs of an artificial neural network when several nodes are pruned from the network. A more in-depth discussion of the theory behind artificial neural networks and the generalized delta rate may be found in Rumelhart et al. (12) and Pao (14). A description of the principal component regression technique used here is given by Mardia et al. (15).
EXPERIMENTAL SECTION Simulated data were used to learn how adjustable parameters affect the performance of a back-propagation network. The adjustable parameters included the gain, the momentum, the learning rate, the number of hidden nodes, the scaling of input data, linear versus sigmoid output functions, scaled outputs, and the number of learning cycles. The simulated data were generated from 70 random linear combinations of the two simulated spectra shown in Figure 4. Each simulated spectrum contained 50 points. The "concentrations" of the two components were uniformly distributed over the range from 0 to 1. Only component A in Figure 4 was quantitated in this part of the study. Back-propagation artificial neural networks having three layers were created with a Turbo Pascal software package written in this laboratory and with the Neural Works 2 Professional Software Package from Neuralware, Inc. Both software packages converged to give nearly identical results. The differences in the results from the two software packages were attributed to the differences in the random starting weights. The specifications for the networks created for the calibration of the simulated data are listed in Table I. The parameters listed in Table I were found to be optimal for fast learning with low prediction errors. The optimization of the parameters was carried out by systematically varying each of the parameters until the best network performance was achieved. Measurement error in the simulated absorption spectra was generated by adding scaled, uniformly distributed random deviates to the spectra. Assuming the maximum absorbance in the data set was 1.0 absorbance units, an error of 5% corresponded to uniform random deviates having a mean of 0 scaled to the range from -0.05 to +0.05 absorbance unit. Fifty of the 70 simulated spectra were used for training, and the remaining 20 spectra were used for testing. One hundred near-infrared spectra of wheat for the determination of protein were obtained from the USDA, Beltsville, MD. This data set and a least-squares curve-fitting analysis of it have been previously published (16). Near-infrared reflectance spectra
1794
ANALYTICAL CHEMISTRY, VOL. 62, NO. 17, SEPTEMBER 1, 1990
Table I. Artificial Neural Network Specifications and Parameters parameter
simulated data
input nodes hidden nodes output nodes learning rate momentum gain input layer transfer function hidden layer transfer function output layer transfer function no. of iterations
variable (see text) variable (see text) 1 0.15 0.0 2.0
linear sigmoid linear 10 000
(log 1/R) of samples from 100 lots of hard red spring wheat were acquired from 1000 to 2600 nm in 1.6-nm intervals. In order to reduce the size of the neural networks, every eighth data point from the spectra was retained to give a reduced spectrum having 126 points per spectrum. All variables in the spectra were mean centered and scaled to a variance of 1.0. The protein content of the wheat samples was determined by using the Kjeldahl procedure. UV-visible spectra of two different pharmaceutical products were also analyzed. The first product was an experimental injectable product containing one active ingredient and a preservative (benzyl alcohol) in aqueous solution. The identity of the active ingredient cannot be revealed because of the proprietary nature of the product. Two concentration ranges were studied. In the first study, the nominal assay concentration for the active ingredient and benzol alcohol was 1.0 and 9 mg/mL, respectively, corresponding to a 10-fold dilution of the product. The W-visible spectra of the mixtures were obtained by using a Hewlett-Packard 8452A diode array spectrophotometer in the wavelength range from 270 to 290 nm in increments of 2 nm. Eleven calibration standards were used, and five test standards were used. This data was previously reported and analyzed by this laboratory and is known to contain a subtle nonlinear response, which was presumed to be due to interaction between the high molecular weight active ingredient and benzyl alcohol (17). A second set of spectra were acquired for the same product by using a 50-fold dilution of samples giving nominal assay concentrations of 0.02 and 1.8 mg/mL for the active ingredient in benzyl alcohol, respectively. Spectra of these samples were acquired from 230 to 270 nm in 2-nm steps. This data has also been previously analyzed in this laboratory and has been shown to give good linear response ( I 7). A UV-visible assay for the active ingredients pseudoephedrine hydrochloride and triprolidinehydrochloride and the preservatives sodium benzoate and methylparaben in Actifed Syrup was developed. Twenty-nine calibration standards and nine test standards were prepared in 0.1 M HCl by using the concentrations illustrated in Table 11. Excipient ingredients were added to each of the calibration standards and test mixtures. Excipients included D&C yellow 10, glycerine, purified water, alcohol, and sorbitol. The concentration of the individual excipient ingredients has not been revealed to protect the proprietary information of Burroughs Wellcome Co. (see Table 11). In one set of experiments, calibration standards and unknowns were prepared and the UV-visible spectra were measured from 202 to 310 nm in 2-nm intervals. The nominal assay concentrations were 0.12, 0.025,0.006, and 0.006 mg/mL of pseudoephedrine hydrochloride, triprolidine hydrochloride, sodium benzoate, and methylparaben, respectively, corresponding to a dilution of 3.0 mL of the product to 500 mL. Evidence of nonlinear response was observed at this concentration range; therefore, a.second set of calibration spectra and unknown spectra were acquired over the same wavelength range, corresponding to a dilution of 3.0 mL of the product to 2000 mL. The Hewlett-Packard diode array spectrophotometer described above was used to measure all UV-visible spectra.
RESULTS AND DISCUSSION In order to determine an optimal set of conditions for training an artificial neural network with a low error of prediction and requiring the fewest possible number of iterations, networks were trained by using the simulated data under a
Table 11. Concentration of Active Ingredients, Preservatives, and Excipients in Actifed Syrup standards (Concentrations Expressed as the Percent of the Nominal Assay Concentration) std. no. 1 2 3 4
5 6 n
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
pseudophedrine HCI
triprolidine
120 80 80 120 80 120 120 80 80 120 120 80 120 80 80 120 140 60 100 100 100 100 100 100 100 100 100 100 100
80 120 80 120 80 120 80 120 80 120 80 120 80 120 80 120 100 100 140 60 100 100 100 100 100 100 100 100 100
HC1
sodium methylbenzoate paraben excipients 80 80 120 120 80 80 120 120 80 80 120 120 80 80 120 120 100 100 100 100 140 60 100 100 100 100 100 100 100
80 80 80 80 120 120 120 120 80 80 80 80 120 120 120 120 100 100 100 100 100 100 140 60 100 100 100 100 100
80 80 80 80 80 80 80 80 120 120 120 120 120 120 120 120 100 100 100 100 100 100 100 100 140 60 100 100 100
wide range of parameter settings. A set of parameters were found that work well for many different data sets. This helped to reduce the amount of testing required to train an artificial neural network for new spectral data sets. The number of nodes in the hidden layer was the only parameter that required further adjustment with each data set. A summary of how each parameter affects network performance is given in the sections below. Learning Rate. A learning rate of 0.15 was found to work well with all the spectroscopic data sets. This value was quite low but helped to maintain network stability during training. If the learning rate was set too high, the network became unstable. The result was either high prediction errors or divergent behavior that caused the output and the weights to grow extremely large. The propensity for a network to exhibit such divergent behavior increased with the size of the network (e.g., the number of hidden nodes). The low learning rate also helped achieve the smallest prediction errors. Momentum. As various learning rates were being investigated, momentum values were also varied in the hopes of finding a ratio of a relative combination of the two parameters that would give the most rapid optimization of the network. As the necessity for a low learning rate became apparent, the use of the momentum term became questionable. It was found that no appreciable advantage was obtained by using a momentum term for such a low learning rate. There were no apparent disadvantages of using the momentum term either. For all calibrations problems, a momentum term of 0 was used. Gain. The gain for the sigmoid output functions was set to 2.0. This allowed a more gently varying sigmoid transfer function. Subtle improvements in the network’s performance (e.g., lower prediction errors) were observed by using a gain of 2.0. The sharper transition from 0 to 1 in the sigmoid transfer function associated with a gain of 1.0 appeared to introduce a stronger bias a t the highest and lowest concentration values in a calibration set. This tendency manifested itself by giving an S-shaped curvature to plots of residuals. When the sigmoid transfer function was used for the output
ANALYTICAL CHEMISTRY, VOL. 02, NO. 17, SEPTEMBER 1, 1990
@
1705
m *O
Om
0.00
00
01
02
03
04
05
08
07
08
08
10
Actual Concentration Figure 5. Plot of the calibration curve for the artificial neural network calibration of component 1 in simulated two-component mixtures. A sigmoid transfer function was used for the output node.
j1
2
3
4
5
8
7
8
0
Ib
Nodes in hidden layer Figure 7. Plot of the relative standard error of calibration, SEC (0), and the relative standard error of prediction, SEP (O), versus the number of nodes in the hidden layer for the simulated data with 1.0% noise in the response variables.
a 0 0
0
"
00 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . I 00 01 02 03 04 05 08 07 OB 08 10
Actual Concentration
Flgure 8. Plot of the calibration curve for the artificial neural network calibration of component 1 in simulated two-component mixtures. A linear transfer function was used for the output node.
node, the effect was compounded and the prediction error was higher. Linear Output Functions. A linear output function was found to be optimal for calibration applications because of its extended dynamic range capabilities. There are no lower or upper bounds to the linear transfer function, whereas the sigmoid transfer function has lower and upper bounds of 0 and 1,respectively. As can be seen in Figure 5, the use of the sigmoid output function results in model error (e.g., lack of fit) at high and low concentration values when applied to linear data. The results in Figure 6 were obtained by using a linear output function. The model error was reduced substantially; however, a small lack of fit is still present, presumably because a nonlinear technique is being applied to linear data. The use of the linear output function allowed for faster learning. Convergence to a minimum error was usually obtained with 10 000 iterations. Identical networks using a sigmoid output function took approximately 5 times as many iterations to minimize the prediction error. Even then, the prediction suffered because of model error. Number of Hidden Nodes. We found the optimum number of nodes required in the hidden layer of an artificial neural network to be specific to each unique data set. The proper number of nodes in the hidden layer for each data set was determined by training artificial neural networks with different numbers of nodes in the hidden layer and then comparing the prediction errors from an independent test set for each network. Figure 7 shows a plot of the standard error of calibration, SEC, and the standard error of prediction, SEP, as a function of the number of nodes in the hidden layer for the simulated data set with 1.0% random noise added to the simulated spectra. The values for SEC and SEP are the root mean squared errors for the calibration set and the test set,
4000
0000
12000
16000
20000
Number of Learnings Figure 8. Plot of the standard error of calibration, SEC (-), and the standard error of prediction, SEP (- - -), versus the number of learning iterations using an artificial neural network for the cailbration of trlprolidlne hydrochloride in Actifed Syrup. Nine nodes with sigmoid transfer functions were used in the hidden layer. A linear transfer function was used for the output node. respectively. For the example shown in Figure 7 , a minimum in SEC and SEP occurred when five nodes were used in the hidden layer. For this data set, we believe that networks with fewer than five nodes in the hidden layer do not pose sufficient complexity to model the data precisely, while networks with more than five nodes are unnecessarily complex, thereby propagating too much random noise through the net to the output node. Continued training beyond 10000 iterations frequently resulted in only a negligible improvement in the network's prediction performance. In some cases, it was possible to overtrain a network. This condition was manifested by a slight increase in SEP as the learning iterations increased while SEC leveled off or continued to decreased only slightly (see Figure 8 for example). We believe this behavior is due to overfitting, where the network begins to model random noise specific to the calibration data. This overtraining results in a corresponding loss of generality, leading to greater prediction errors for the test data. For a few data sets, networks with one node in the hidden layer were found to be optimal. In each of the examples where only one hidden node was required, neural networks were trained having no hidden layer with a sigmoid transfer function. For these data sets, the results for the two-layer networks were as good as or slightly worse than the results obtained for networks having three hidden layers. For example, SEC and SEP for protein in wheat (as is) were 1.2% and 2.2% for the three-layer network compared to 1.1%and 2.0% for the two-layer network. For benzyl alcohol in the
1796
ANALYTICAL CHEMISTRY, VOL. 62, NO. 17, SEPTEMBER 1, 1990
Table 111. Comparison of Neural Network Calibration to Principal Component Regression data set 7% random error
in response vars (simulated data)
artificial neural network no. of hidden nodes
no. of learning iterations 10 000 10 000 10 000 10000
0.1 0.5 1.0 5.0
7% random error
1.1 1.1 1.1 2.1
0.84 0.89 0.84 1.7
artificial neural network
in conc variables (simulated data)
no. of hidden nodes
no. of learning iterations
1.0
5 5 5
10 000 10 000 10 000
5.0 10.0
relative error SEC SEP
relative error SEC SEP 1.3 1.8
2.7
0.98 1.6 3.2
artificial neural network near-IR spectra of wheat
no. of hidden nodes
no. of learning iterations
protein (as is) protein (dry basis)
1 1
35 40 000
exptl injectable prod. (dilute samples) active ingr benzyl alcohol exptl injectable prod. (conc samples) active ingr benzyl alcohol
relative error SEC SEP
ooo
1.2
1.3
2.2 2.3
artificial neural network no. of hidden nodes
no. of learning iterations
3
4000 2000
1
relative error SEC SEP 0.50 0.16
0.18 0.099
artificial neural network no. of hidden nodes
no. of learning iterations
2
10 000 11000
1
relative error SEC SEP 0.16 0.15
0.25 0.31
artificial neural network Actifed Syrup (dilute samples)
no. of hidden nodes
no. of learning iterations
pseudoephedr triprolidine sodium benz methylparaben
6 9 10
6 000 16 000 14 000 20 000
1
relative error SEC SEP 0.741 1.68 0.724 0.833
0.721 2.09 1.14 0.711
artificial neural network Actifed Syrup (conc samples)
no. of hidden nodes
no. of learning iterations
pseudoephedr triprolidine sodium benz methylparaben
10 1 5 3
10000 15 000 4000 8 000
concentrated injectable samples, SEC and SEP were 0.15% and 0.31% for the three-layer network compared to 0.46% and 0.66% for the two-layer network. This result indicates that only fist-order features are important in these examples. The slightly better performance of the three-layer networks can be explained as follows: In the three-layer networks, a sigmoid transfer function was used for the hidden node and a linear transfer function was used for the output node. The weighting factor connecting the hidden node to the output node in the three-layer network was adjustable, thereby allowing the training process to automatically select a short segment (large weight) or a long segment (small weight) of the sigmoid transfer function, whichever happened to best fit the data. The training process is also able to automatically adjust the bias weight to the output layer, thereby selecting
relative error SEC SEP 3.12 2.33 0.631 0.737
16.92 6.51 0.646 0.328
DrinciDal comDonent regression relative error no. of principal comp SEC SEP 2 2
0.024 0.12 0.27 1.4
2
2
0.020 0.088 0.19 1.0
principal component regression relative error no. of principal comp SEC SEP 2 2 2
0.30 1.3 2.3
0.28 1.4 3.3
principal component regression relative error no. of principal comp SEC SEP 7 7
1.1 1.1
1.9 1.9
urincbal comDonent regression relative error no. of principal comp SEC SEP 2 2
0.47 0.21
0.35 0.15
principal component regression relative error no. of principal comp SEC SEP 2 4
0.40 0.18
0.32 0.36
principal component regression relative error no. of principal comp SEC SEP 5 7 4
4
0.92 1.37 0.77 0.24
1.25 2.36 2.05 0.78
DrinciDal comDonent regression relative error no. of principal comp SEC SEP 8 4
7 5
1.58 11.66 0.22 0.56
7.67 12.1 0.85 0.49
the convex, concave, or approximately linear portions of the sigmoid curve, whichever happen to best fit the data. Results Using Simulated Data. The results for the calibration of the simulated data are summarized in Table 111. Calibration and prediction were performed on each data set by using an artificial neural network and principal component regression (PCR). For all of the simulated data sets, five hidden nodes in the artificial neural networks generally gave the best performance. No significant improvement in the artificial neural network results was observed beyond 10000 interations of the learing cycle. In nearly all cases, prinicipal component regression outperformed the artificial neural networks for these perfectly linear data sets. Examination of the residuals for all of these perfectly linear data sets showed that model error from fitting sigmoid functions
ANALYTICAL CHEMISTRY, VOL. 62, NO. 17, SEPTEMBER 1, 1990
to linear data was responsible for the inferior performance of artificial neural networks. Model error was manifested by the appearance of S-shaped curvature in the residual plots. When random error was added to the response variables, nearly constant standard error of calibration and standard error of prediction were obtained from the artificial neural network calibrations. When random error was added to the concentration variables, similar results were obtained. Clearly, model error was the most significant factor affecting the artificial neural results. At 5% noise, the performance of both techniques (PCR and artificial neural networks) was comparable, indicating that model error in the artificial neural networks was no loner as significant at this high level of noise. Results Using Near-Infrared Spectra of Wheat. The results of the calibration of protein in wheat using near-infrared spectra are summarized in Table 111. The PCR results reported here compare favorably to the previously published least-squares results using the "wide" wavelength range (16). Table I11 reveals a slight improvement in the PCR results over artificial neural networks. An artificial neural network with one hidden node gave the best performance (lowest SEP). No significant improvement in the results was observed after 35 000-40 000 iterations of the learning cycle. A plot of the residuals did not reveal any obvious differences or bias between the two methods. It seems that the nonlinear sigmoid transfer function used in the artificial neural network gives slightly worse performance than does the linear model used by PCR. Results Using UV-Visible Spectra of Pharmaceutical Products. The results of the calibration of benzyl alcohol and the active ingredient in the experimental pharmaceutical product are summarized in Table 111. A slight improvement in the artificial neural network results compared to PCR can be observed for the concentrated standards where nonlinear response is present. For the dilute standards, artificial neural networks outperformed PCR. Overall, the artificial neural network gave slightly better results than PCR. The results for the calibration of the active ingredients and preservatives in Actifed Syrup are shown in Table 111. For the concentrated samples, all spectra exhibited absorbances greater than 1.5 absorbance units in the wavelength range from 202 to 210 nm. The diode array spectra in this region were compared to spectra from a conventional double-beam scanning instrument and showed clear evidence of being affected by stray light: The maximum measured absorbances in the diode array spectra were less than the maximum measured absorbances in the spectra from the double-beam scanning instrument. The artificial neural network calibration results are especially interesting because of the presence of these nonlinear instrumental artifacts. Specifically, the artificial neural network results (SEP) for the two preservatives, sodium benzoate and methylparaben, in the concentrated samples are slightly better than the results using PCR. These results are better than the results for the preservatives in the dilute samples because neither sodium benzoate nor methylparaben exhibit strong enough absorption for quantitation at the dilute concentrations in these wavelength ranges. The principal component regression results for the two preservatives in the concentrated samples are worse compared to the artificial neural network results because the nonlinear response present in these spectra is inadequately modeled by PCR. Preliminary
1797
studies indicate that partial least squares (PLS) gives results that are statistically no different from the PCR results; thus, the same conclusion can be drawn for PLS (e.g., PLS does not adequately model the nonlinear response present in these spectra). For the two active ingredients, pseudoephedrine hydrochloride and triprolidine hydrochloride, the artificial neural network calibration gave the overall lowest standard error of prediction (SEP) for the dilute samples. These results are slightly better than the results obtained by PCR.
CONCLUSIONS Training artificial neural networks for the purposes of calibrating a multicomponent spectroscopic assay can be a lengthy and tedious task. For some of our largest networks, 5-6 h was required for 40 000 learning iterations on a Zenith 80386 computer operating at 33 MHz. In the near future, inexpensive specialized parallel computing hardware may become available that will be capable of performing the training operations faster by several orders of magnitude. When this type of hardware becomes available, artificial neural networks may become a powerful tool for spectroscopic calibration. In data sets where strict linear response is observed, PCR or other methods based on linear additive models should be expected to give the best performance. Our studies indicate that when nonlinear response due to solute interactions or due to nonlinear instrumental response functions are present, artificial neural networks may be capable of giving superior performance for spectroscopic calibration. ACKNOWLEDGMENT The authors wish to acknowledge Karl Norris and the USDA Beltsville Agricultural Research Center for providing the near-infrared wheat data set. Credit is given to Burroughs Wellcome Co. for in kind support of this research in the form of chemicals and instrument time. LITERATURE CITED Denker, J. S., Ed. NeuralNefworks for Computfng. American Institute of Physics Conference Proceedings, Snowbird, UT, 1986; No. 151. IEEE Proceedings of 2nd Annual International Conference on Neural Networks, June, San Diego. CA, 1968. Carpenter, G. A.; Grossberg, S. Appl. Opt. 1987, 2 6 , 4919-4930. Farhat, N.; Psakis, D.; Prata, A.; Paek, E. Appi. Opt. 1985, 2 4 , 1469. Waibei, A.; Hanazawa, T.; Hinton, G.; Shikano, K.; Lang, K. I n IEEE Transactions on Acoustics, Speech, and Signal Processing, March 1989; Voi. ASSP-37. Deigutte, D. J . Acoust. SOC.Am. 1984, 75, 879-886. Ghfiza, 0.I n Proceedings Internatlonal Conference on Acoustics, Speech, and Signal processing, Dallas, TX, April 1987; Vol. ICASSP87. Kiimasauskas, C. C. Neuralworks User's Gukle; Neuralware, Inc.: Sewickley, PA, 1988; p 453. Lapedes, A.; Farber, R. Los Alamos National Laboratory Report LAUR-87-2662, 1987. Anderson, C. W. I n Proceedings of the Fourth Internatbnal Workshop on Machine Learning, University of California, Irvine, 1987; pp 103-114. Pao, Y. H. J . Intell. Rob. Syst. 1988, 7 , 35-53. Rumelhart, D. E.; McClelland, J. L.; and the PDP Research Group. Parallel Distributed Processing: Explorations in the Microstructure of Cognition; MIT Press: Cambridge, MA, 1986; Part 1. Llppman, R. P. IEEE ASSP Mag. 1987, 4 , 4-22. Pao, Y. H. Adaptive Pattern Recognition and Neural Networks; Addison-Wesley Inc.: Reading, MA, 1989. Mardia. K. V.; Kent, J. T.; Blbby, J. M. Multivariate Analysis; Academic Press: London, 1979. Hruschka. W. R.; Norrls, K. H. Appl. Spectrosc. 1982, 36, 261-265. Gemperline, P. J.; Salt, A. J . Chemom. 1989, 3 , 343-357.
RECEIVED for review January 10,1990. Accepted May 17,1990.