Anal. Chem. 1998, 70, 1297-1306
Temperature-Constrained Cascade Correlation Networks Peter de B. Harrington
Center for Intelligent Chemical Instrumentation, Clippinger Laboratories, Department of Chemistry, Ohio University, Athens, Ohio 45701-2979
A novel neural network has been devised that combines the advantages of cascade correlation and computational temperature constraints. The combination of advantages yields a nonlinear calibration method that is easier to use, stable, and faster than back-propagation networks. Cascade correlation networks adjust only a single unit at a time, so they train very rapidly when compared to backpropagation networks. Cascade correlation networks determine their topology during training. In addition, the hidden units are not readjusted once they have been trained, so these networks are capable of incremental learning and caching. With the cascade architecture, temperature may be optimized for each hidden unit. Computational temperature is a parameter that controls the fuzziness of a hidden unit’s output. The magnitude of the change in covariance with respect to temperature is maximized. This criterion avoids local minima, forces the hidden units to model larger variances in the data, and generates hidden units that furnish fuzzy logic. As a result, models built using temperature-constrained cascade correlation networks are better at interpolation or generalization of the design points. These properties are demonstrated for exemplary linear interpolations, a nonlinear interpolation, and chemical data sets for which the numbers of chlorine atoms in polychlorinated biphenyl molecules are predicted from mass spectra.
Artificial neural networks (ANNs) are powerful pattern recognition tools. Several commercial software packages are designed to provide a suite of ANN methods, such as the MATLAB Neural Network Toolbox (Natick, MA) and NeuralWare, Inc. (Pittsburgh, PA). In addition, texts are available that supply source code for ANN toolboxes. Two examples that furnish source code in C++ are those by Masters1 and Rao and Rao.2 The most common neural network used by chemists has been the back-propagation neural network (BNN), which has been reviewed in the chemical literature.3-5 BNNs have been applied to the identification and characterization of spectra such as UV,6 IR,7 and IR-MS.8 They (1) Masters, T. Practical Neural Network Recipes in C++; Academic Press: Boston, MA, 1993. (2) Rao, V. B.; Rao, H. V. C++ Neural Networks and Fuzzy Logic; Management Information Source, Inc.: New York, 1993. (3) Jansson, P. A. Anal. Chem. 1991, 63, 357A-362A. (4) Zupan, J.; Gasteiger, J. Anal. Chim. Acta 1991, 248, 1-30. (5) Wythoff, B. J. Chemom. Intell. Lab. Syst. 1993, 18, 115-155. S0003-2700(97)00851-2 CCC: $15.00 Published on Web 03/04/1998
© 1998 American Chemical Society
have been used for distinguishing among manufacturers of the same pharmaceutical by classification of HPLC profiles9 and for predicting the onset of diabetes mellitus.10 Besides identification and classification, BNNs have been used for quantitative modeling of nonlinear kinetics,11 fiber-optic fluorescence data,12 stripping analysis of heavy metals,13 and toxicity of benzothiazolium salts.14 BNNs have been used for correcting MS drift.15 Many varieties of the BNN algorithm exist, most of which differ in training algorithm to overcome the long training times that are typically encountered. Neural networks and BNNs in particular have found only relatively modest application in chemistry. Perhaps this trend has been due to unpredictable models that arise from overfitting the training set and the difficulties in training BNN networks. Cascade correlation networks (CCNs) offer several advantages over BNNs.16 The CCN constructs its own topology (i.e., number of layers and hidden units) during training. The cascade architecture permits the network to model high orders of nonlinearity with fewer units than a two-layer BNN. The most significant advantage of the CCN over the BNN is the significantly faster training rate. Training both the BNN and CCN is often accomplished by gradient descent methods. One problem that causes slow training with gradient methods has been termed “reverse priorities” (i.e., small adjustments for flat spots on the response surface and large adjustments when the response surface is changing abruptly). Although many algorithms exist to accelerate neural network training, the Quickprop algorithm is used for this work.17 A more detrimental cause of slow training for the BNN is the simultaneous adjustment of all the weights in the network. This (6) Mittermayr, C. R.; Drouen, A. C. J. H.; Otto, M.; Grasserbauer, M. Anal. Chim. Acta 1994, 294, 227-242. (7) Klawun, C.; Wilkens, C. L. J. Chem. Inf. Comput. Sci. 1996, 36, 69-81. (8) Klawun, C.; Wilkens, C. L. J. Chem. Inf. Comput. Sci. 1996, 36, 249-257. (9) Welsh, W. J.; Lin, W.; Tersingni, S. H.; Collantes, E.; Duta, R.; Carey, M. S.; Zielinski, W. L.; Brower, J.; Spencer, J. A.; Layloff, T. P. Anal. Chem. 1996, 68, 3473-3482. (10) Shanker, M. S. J. Chem. Inf. Comput. Sci. 1996, 36, 35-41. (11) . Ventura, S.; Silva, M.; Pe´rez-Bendito, D.; Herva´s, C. J. Chem. Inf. Comput. Sci. 1997, 37, 287-291. (12) Sutter, J. M.; Jurs, P. C. Anal. Chem. 1997, 69, 856-862. (13) Chan, H.; Butler, A.; Falck, D. M.; Freund, M. S. Anal. Chem. 1997, 69, 2373-2378. (14) Hatvrı´k, S.; Zahradnı´k; P. J. Chem. Inf. Comput. Sci. 1996, 36, 992-995. (15) Goodacre, R.; Kell, D. B. Anal. Chem. 1996, 68, 271-280. (16) Fahlman, S. E.; Lebiere, C. The Cascade Correlation Architecture; Report CMU-CS-90-100; Carnegie Mellon University: Pittsburgh, PA, Aug 1991, pp 1-13. (17) Fahlman, S. E. An Empirical Study of Learning Speed in Back-Propagation Networks; Report CMU-CS-88-162; Carnegie Mellon University; Sep 1988, pp 1-17.
Analytical Chemistry, Vol. 70, No. 7, April 1, 1998 1297
problem has been referred to as the “moving target problem”.16 The error that is propagated backward to the hidden units also depends on the output units, which are changing during training. No communication occurs among units within a hidden layer. Although the weight vectors are initially pointing in random directions, they will quickly become correlated by orienting themselves in the direction that reduces the largest error. Consequently, the weight vectors of units in the same layer tend to move in the same direction. Algorithms that accelerate training rate frequently will accentuate this effect, which has been termed the “herd effect”.16 The direction of the largest error will change, so that a new minimum will dominate, and the weight vectors that were previously positioned in the error response surface minimum may be driven toward the new dominant error. Consequently, training a BNN is a highly chaotic procedure, and it is remarkable that convergence is achieved at all. For nonlinear calibration, BNN models may not converge. The CCN eliminates this training chaos by adjusting only a single processing unit at a time. The CCN still does not solve the problem of overfitting. One method that reduces the effects of overfitting is the use of temperature constraints. An alternative approach that uses a decay parameter has been used for classifying ion mobility spectra.18 The term “computational temperature” refers to the analogous temperature that is used in simulated annealing and Boltzmann learning machines. If one views the hidden unit as a perceptron, then the temperature parameter adds a width to the classification hyperplane. In addition, the output values of the hidden unit will have a continuous range of values, and the hidden unit furnishes a fuzzy logic.19 A temperature-constrained backpropagation neural network (TC-BNN) has been devised.20 A global temperature (i.e., same temperature for all the units in the network) was slowly lowered during training of the network. This network provided some stability with respect to overfitting, but it required a global temperature parameter for the entire network. Furthermore, temperature constraints may improve training efficiency for cases where the network converges slowly or not at all. In general, constraining the network by a temperature parameter results in slower training. A temperature-constrained cascade correlation network (TCCCN) has been devised. The goal was to develop a neural network system that would have improved performance for nonlinear calibration problems. The coupling of CCNs, which train rapidly, offsets the slower training times introduced by the temperature constraints. An added benefit that arises from training a single unit is that the temperature parameter can be locally optimized for the individual hidden units. The hidden units may be considered optimal fuzzy feature selectors. This learning algorithm combines the advantages of rapid speed and self-configuring topology of the cascade correlation network with the soft modeling capabilities of temperature constraints. The advantage of this TC-CCN over the TC-BNN is that each hidden unit is optimized with respect to computational temperature. (18) Zheng, P.; Harrington, P. B.; Davis, D. M. Chemom. Intell. Lab. Syst. 1996, 23, 121-132. (19) Harrington, P. B. J. Chemom. 1991, 5, 467-486. (20) Harrington, P. B. Anal. Chem. 1994, 66, 802-807.
1298 Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
THEORY The TC-CCN does not connect the inputs directly to the output units. Instead, all the inputs to the output unit must be passed through the fuzzy hidden units, which helps prevent overfitting. Figure 1 is a schematic of the TC-CCN architecture that gives a five-step procedure for constructing these networks. The TC-CCN uses linear output units that are adjusted by a singular value decomposition regression algorithm. The residual error is used for training the hidden units. The residual error is defined as
ei ) yˆi - yi
(1)
for which ei is the residual error obtained for the ith object from the predicted value yˆi and the target value yi. The number of inputs into the output units will equal the number of hidden units. The hidden units are added sequentially. Each time a hidden unit is added, the output weight vector is recalculated by regression, and new residual errors are generated. The hidden units train by adjusting their weight vectors so that the pooled magnitude of the covariance between a hidden unit’s output and the residual error from the output units is maximized. Several hidden units may be trained simultaneously, and the unit with the largest change in covariance with respect to temperature is selected from the candidate pool for the TCCCN. The candidate units for the CCN are temperature-constrained sigmoid units. Processing comprises a linear and a nonlinear operation. The linear operation is obtained by ν
∑w
netij )
jm xim
m)1
|wj|
+ bj
(2)
for which ν is the number of input connections to unit j, wjm is a component of the weight vector, and xim is the input activation coming from the mth neuron in the preceding layer for the ith object. For the temperature parameter to be meaningful, the weight vector must be constrained to a constant length. The weight vector is normalized to unit Euclidean length. The vector length of the jth hidden unit weight vector is defined as |wj|. The bias value for the jth hidden unit is designated as bj. The nonlinear operation is given by
f (netij) ) oij ) (1 + e-net ij /tj)-1
(3)
for which netij is the input to the logistic function of the ith observation and the jth neuron and oij is the corresponding nonlinear output. The weight vector is trained so that it points in the direction that maximizes the covariance of the unit’s output with the residual error. The temperature (tj) is adjusted so that it maximizes the magnitude of the first derivative of the covariance between the output and the residual error with respect to temperature. This objective function is advantageous, because it causes the error surface to remain steep, which facilitates gradient
Figure 1. Five-step procedure for training the temperature-constrained cascade correlation network. The white square indicates the unit that is trained. Output units are linear, and the hidden units are sigmoidal.
training. In addition, outputs are continuously distributed throughout their range when the derivative is maximized, which ensures fuzzy interpolation of the hidden unit. The weight and bias parameters are adjusted so that the magnitude of pooled covariance between a hidden unit output and the residual error from the output units is maximized. The covariance magnitude (|Cj|) of the output from candidate unit j and the residual error from output k is obtained from p
|Cj| )
|
n
∑ ∑(o k)1 i)1
ij
|
- ojj)(eik - jek)
(4)
for which the covariance is calculated with respect to the n observations in the training set. The absolute values of the covariances are added for the p output units. The averages are obtained for the n objects in the training set for the hidden unit output (oj) and error (ej). The denominator of n - 1 is omitted from the calculation, because it is constant through the entire
training procedure. The weight vectors are adjusted by p
∆wjm )
n
(
∑ ∑x dk
k)1
i)1
∂f (netij) im
∂netij
- xim
∂f (netij) ∂netij
)
(eik - jek) (5)
for which the derivatives for each covariance are summed for the p output units. The variable dk is equal to 1 when Cjk is greater than zero and equal -1 when the covariance is less than zero. This parameter controls the adjustment in the direction of the largest covariance magnitude. The xim term is 1 for the bias adjustment. The first derivative of the temperature-constrained sigmoid function with respect to net (∂f(netij)/∂netij) for the jth unit is
∂f (netij) ∂netij
)
oij(1 - oij) tj
(6)
for which the input signal netij produces the output oij. The first Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
1299
Figure 2. Covariance response surface with respect to temperature and bias for a single-dimensional classification of the values of 1 and 2.
derivative of the temperature-constrained logistic function with respect to temperature (∂f(netij)/∂tij) is
∂f (netij) ∂tj
)
-netijoij(1 - oij)
(7)
tj2
The first derivative of the covariance magnitude with respect to temperature is given by
∂|Cj| ∂tj
)
p
-
n
∑ ∑(net o (1 - o ) - net o (1 - o ) )(e dk
ij ij
k)1
ij
ij ij
ij
ik
- jek)
i)1
(8)
tj2
Note that this derivative is simply the covariance of the derivatives of the logistic function and the residual error. The temperature is adjusted to maximize the absolute value of ∂|Cj|/∂tj through the use of the second derivative of the covariance with respect to temperature (∂2|Cj|/∂tj2).
∂2|Cj|
p
)
∂tj2
n
∑ ∑ dk
k)1
i)1
(
∂2f (netij)
-
∂2f (netij)
∂tj2
∂tj2
)
(eik - jek) (9)
for which ∂2f(netij)/∂netij2 is the second derivative for unit j and observation i.
∂2f (netij) 2
∂tj
)
(
-1 2netij(1 - oij) tj
tj
-
netij tj
+2
)
∂f (netij) ∂tj
(10)
The dk term corrects for the sign of the covariance. The magnitude of the first derivative is maximized by controlling the temperature through the second derivative (∂2|C|/∂t2). The response surface for the covariance as a function of bias and temperature is given in Figure 2, for the single-dimensional 1300 Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
problem of locating a classifier between the numeric values of 1 and 2. For this calculation, bj from eq 2 is the bias, and temperature tj is from eq 3. The weight in eq 2 is simply unity, because it is a normalized scalar. The values for the inputs are 1 and 2, and the outputs are 1 and 2. Therefore, the covariance between the errors (i.e., the mean centered target values) and the inputs will be maximized when the bias value is -1.5, and in Figure 2 the maximum covariance was obtained at this value. Temperature is an important parameter. When the temperatures are too high or too low, the response surface is flat, and the change in covariance with respect to the bias is zero. The parameters of the network (i.e., weights and bias) are adjusted with derivatives. When the response surface curvature is maximized, the networks can be rapidly trained. A second advantage of controlling the curvature of the response function is that the bias value of -1.5 is uniquely defined as the maximum of the curve at intermediate temperatures. If the temperature is too low, any bias value between -1 and -2 will give the same covariance. This ambiguity can result in prediction errors and lack of reproducibility when the models must interpolate. The covariance at the optimum bias (-1.5) is plotted as a function of temperature in Figure 3. The temperature is adjusted during network training for the TC-CCN so that the first derivative of the covariance with respect to temperature (∂C/∂t) is minimized, which maintains the response surface at maximum curvature. This curvature and computational temperature is related to the fuzziness and softness of the hidden units.21 The Quickprop algorithm is used to adjust the temperature of each unit.17 If t is adjusted so that it is less than zero, then t is set to its previous value. The condition that the t has become negative is caused by overcorrection of this parameter; by setting the slope adjustment to zero, only a linear step without momentum will be taken on the next training cycle. Steps toward negative adjustments will occur when the Quickprop algorithm makes an overcorrection, so the training algorithm is slowed only for such temperature modifications. To enhance training speed when the first and second derivatives of the covariance are less than zero, the temperature is (21) Harrington, P. B. Chemom. Intell. Lab. Syst. 1993, 19, 143-154.
step is implemented through singular value decomposition.22
wk ) VS-1UTyk
Figure 3. Plot of covariance (s) and the first derivative with respect to temperature (- - -) at the bias value of -1.5.
decreased by 1%. This accelerates training at high temperatures, where the derivatives would be small. The networks can be trained at an extremely fast rate. However, the modeling ability requires that the weights are adjusted, so that the covariance is maximized, before the temperature is adjusted rapidly. Otherwise, the units are adjusted at too low a temperature, the optimization can be trapped at a local minimum, and no benefits compared to an unconstrained network are achieved. The architecture of the cascade correlation network has been modified, so that the input units do not feed into the output unit. In addition, the output unit is linear, and the hidden units are sigmoidal. Figure 1 gives the steps that are used for training for the modified architecture. Only a single output unit is used for the work presented in this paper. However, the algorithm works in a similar manner for multiple output units. As a reminder, the output units are adjusted using linear regression. In step 1, the output unit is trained, but only the bias value is adjusted. For this step, the bias value will equal the negative value of the average of the target values for the output. A hidden sigmoidal unit is added that has access to the input values. The weight vector is adjusted so that the magnitude of the covariance is maximized between the residual output error and the output of the hidden unit eq 4. The weight vector will be constrained to unit vector length. The temperature is adjusted, so that ∂|C|/∂t is maximized (eq 10). Once the hidden unit is adjusted, it will be held constant. Once the candidate units are trained, the one that gives the largest covariance is selected. The output units are trained by regression of the target values for property k onto the column space of the hidden unit output, O, as given,
wk ) (OTO)-1OTyk
(11)
for which wk is the weight vector for output unit k, O is the matrix of outputs for the hidden units, and yk is a vector of target outputs for property k. The matrix O has n rows that correspond to each observation in the training set and has m + 1 columns (i.e., one for each hidden unit output and column of values of unity). Augmenting the matrix with a column of ones allows a bias value to be calculated for the output unit weight vectors. The regression
(12)
for which V and U are eigenvectors that respectively span the row and column spaces. The decomposition is implemented so that only singular values (S) and corresponding eigenvectors greater than 1 part per thousand of the singular value sum are used for the inversion. Once the output units are trained, a new residual error is calculated. In step 4 of Figure 1, a new hidden unit is added, and the output from the previously trained hidden unit is used as an input, along with the other data inputs. The new hidden unit is adjusted. In step 5, the output unit is readjusted by regression with the two hidden units’ output and bias value as inputs. This procedure continues until a specified residual error is achieved. The cascade networks differ from other ANNs in that their residual error decreases by discrete steps. Each time a hidden unit is added and trained, the output unit weight vectors are recalculated by regression of the target values onto the column space defined by the hidden unit outputs. Therefore, unlike other neural networks, CCNs do not train to a precisely defined residual error. Standard errors are reported as root-mean-square prediction or calibration error. The expression is given by
RMSSEC ) RMSSEP )
x
n
p
∑ ∑(y
ik
- yˆik)2
i)1 k)1
np
(13)
for which RMSSEC refers to errors of the calibration or training set, and RMSSEP refers to errors that were generated by the external prediction set. The RMSSEC values are not corrected for the loss of degrees of freedom. Determining the correct number of degrees of freedom for neural networks is difficult. The PLS values were not corrected either, so that they would be comparable to the neural network values. This approximation is valid because the number of PLS latent variables (i.e., less than 6) was small with respect to the number of objects used during training (i.e., more than 100). EXPERIMENTAL SECTION All calculations were obtained on a single processor 200-MHz Intel Pentium Pro computer equipped with 64 MB of RAM. The operating system was Microsoft Windows NT 4.0C. The software was written in C++ and compiled with a Watcom 11.0 for 32-bit flat mode (Microsoft Win32S, Win95, WinNT systems). All calculations used single-precision (32 bit) floating point arithmetic. Mass spectra were obtained from the Wiley Registry of Mass Spectral Data, 5th ed. All spectra were electron impact spectra. The reference data were composed of 139 polychlorobiphenyl (PCB) spectra that were retrieved with unique configurational formulas, which included biphenyl. The prediction sets of mass (22) Press: W. H.; Teukolsky, S. A.; Vettering, W. T.; Flannery, B. P. Numerical Recipes in C; Cambridge University Press: Cambridge, UK, 1994; pp 5967.
Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
1301
spectra were acquired by removing all the spectra from the training set with a given number of chlorine atoms. The number of chlorine atoms in each molecule was the property that was modeled. These values ranged from 0 to 10. After preprocessing, each spectrum was composed of 464 points that ranged from 50 to 514 amu. These values were acquired for the entire set of training data. The number of points varied slightly (by 1 or 2) and depended on the training set of data and a mapping routine. The evaluation used a mapping routine that rounded the massto-charge axis to the nearest atomic mass unit. When several peaks in the same spectrum rounded to the same atomic mass unit value, they were co-added. The mass-to-charge axis is used a reference. For the prediction and monitoring sets, if the spectra contained atomic mass unit values that were missing from the reference axis, they were omitted from the calculation. Peaks that were missing in the prediction and monitoring data at the reference positions were assigned intensity values of zero. All the neural network evaluations used the same conditions. The learning rate was set to 0.001. A parameter for Quickprop (µ) was set to 1.0. This value is similar to the momentum term used in back-propagation networks. The networks were trained to a residual relative error of 1% for the synthetic data. For the evaluations with these data, all networks trained five candidate units in parallel. The CCN and TC-CCN selected the candidate unit with largest covariance magnitude (|C|). For the mass spectral data, the TC-CCN used only a single candidate unit for each hidden unit training cycle. For the PLS calculation, PLS-1 was used. The training times for the PLS and the compressed TC-CCN calculations were less than 1 min. For the uncompressed MS data, the TC-CCN took 2 min to train. Under the same conditions, the CCN trained in less than 1 min but yielded predictions that were less accurate. As a point of comparison, a BNN with 10 sigmoidal hidden units and a single linear output unit also trained in less than 1 min. DISCUSSION OF RESULTS The networks were evaluated with respect to repeatability by training the networks with different initial weight values. Repeatability measures the precision of the model building process with respect to the prediction set and furnishes a measure of precision. When a network trains to a low prediction error for only a specific set of starting conditions, the network may be overfitting the prediction set and may perform poorly when used with other external data. It may be argued that the model and not the training is important. However, lack of repeatability in training may be the reason that neural networks are not more widely used. The different initial weight values were specified by different seed values for the random number generator. For comparison between networks, the same seed value was used, so that the CCN and TC-CCN were started with the same initial settings. A demonstration compares identically configured CCN and TCCCN networks for learning a line from five training points, (1, 20), (2, 30), (3, 40), (4, 50), and (5, 60). This comparison shows the benefits of the temperature constraints on the predictive ability of the network model. For evaluating the data, nine points that are evenly spaced at a 0.1 interval between pairs of training points were used. The total number of prediction points was 45. These results are reported in Table 1. Figure 4 gives the prediction results of the first trial. Note that the unconstrained network 1302 Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
Table 1. Results from Linear Interpolation Evaluation trial
TC-CCN (RMSSEP)
CCN (RMSSEP)
1 2 3 4 5
0.066 0.066 0.062 0.066 0.062
1.92 2.85 2.73 4.19 3.32
av, SD
0.064 ( 0.002
3.00 ( 0.83
Figure 4. Trial 1 results of linear interpolation for the two networks of a line with a slope of 10 and an intercept of 10: training points (0), and the prediction sets of the TC-CCN (O) and CNN (4).
(CCN) classifies the points as the design point values, which may be observed in the step shape pattern of the predictions. In terms of relative error (i.e., prediction error divided by mean of the dependent block variables), both networks predict well. The CCN and TC-CCN were identically configured with respect to parameters. Fitting a straight line with a nonlinear neural network demonstrates that unconstrained networks will classify the design points with negligible errors. By adding constraints to the network, softer models that interpolate between the design points may be achieved. The same results would be expected if the training data contained noise. This case demonstrates the TCCCN’s ability to interpolate with respect to the independent variable. A similar evaluation examined pathological training sets. The generalized inverse will become singular when independent variables of the same value have different properties or dependent variables. The calibration model is then forced to predict two different values from the same design point. Usually, this will drive the model to overfitting the data, so that small errors due to noise or computational imprecision of the design point are exploited by the network model. The reverse case arises when different independent variables have the same property. This case does not introduce singularities. This data set was composed of 10 training points, (1, 15), (1, 25), (2, 25), (2, 35), (3, 35), (3, 45), (4, 45), (4, 55), (5, 55), and (5, 65). For this evaluation, both PLS and the TC-CCN worked well. The results are reported as RMSSEP calculated from the median of each pair of dependent variables that corresponded to the same
Table 2. Pathological Training Study trial
TC-CCN (RMSSEP)
CCN (RMSSEP)
1 2 3 4 5
0.064 0.064 0.062 0.064 0.062
2.86 2.65 2.95 4.07 2.42
av, SD
0.063 ( 0.001
3.00 ( 0.83
Figure 6. Gaussian function, modeled by the two neural networks, which must interpolate between design points. The design points are the integers, and the prediction points are the midpoints between each pair of integers. The right ordinate gives the residual values obtained by subtracting the network prediction from the value obtained from the Gaussian function. The O and 4 are the interpolating predictions of the TC-CCN and CCN, respectively. These results are from trial 1. Table 3. Gaussian Function Interpolation trial Figure 5. Pathological data, with two dependent variables, attributed to the same independent variable: design points (0) and interpolating predictions of the TC-CCN (O) and CCN (4). These results are from trial 1.
independent variable. In other words, this evaluation forced the neural networks to interpolate between the dependent variables or conflicting target points. The results are reported in Table 2. The TC-CCN obtained an average RMSSEP of 0.063 ( 0.001. The results for the first trial are given in Figure 5. Note that the unconstrained network (CCN) fits one set of design points, while the TC-CCN attempts to fit the median of each pair of conflicting design points. The TC-CCN converged earlier than the target value of 5%. When the training error does not decrease after training the output layer, the network aborts and readjusts the output layer with the last hidden unit omitted. This example demonstrates the TC-CCN’s ability to interpolate with respect to the dependent variables. A nonlinear model of a Gaussian function was evaluated. The network was trained with independent variables that were integers from -30 to 29. The target points were obtained from a Gaussian function that had a standard deviation of 10 and a maximum intensity of 100. For this case, the neural networks were evaluated with points that bisect the integers (e.g., 1.5, 2.5, ...) and compared to the values passed into the Gaussian function. The networks were trained to simulate the Gaussian function. Given a single input, the network should generate a result that approximates the result obtained from the Gaussian function. The RMS error measures the difference between the Gaussian output and the neural network prediction. This experiment demonstrates the ability of the networks to interpolate nonlinear trends in the data. The results of the first trial are given in Figure 6. The Gaussian function is given by the straight line. The residual error (the Gaussian function for the interpolating point is subtracted from network prediction) is plotted for the two networks and given on
TC-CCN (RMSSEP)
CCN (RMSSEP)
1 2 3 4 5
0.37 0.33 0.70 0.54 0.35
1.41 1.50 1.02 2.56 1.56
av, SD
0.46 ( 0.16
1.61 ( 0.57
the right ordinate. These results are given in Table 3. In this case, PLS could not model these nonlinear data. Four evaluations compared TC-CCN’s performance with that of PLS for interpolation with complex data sets. Each evaluation used a different seed value, so that the initial conditions of the network would vary. The same seed values were maintained for the different prediction sets. PLS was selected as a reference method, because it is a robust and standard method for nonlinear calibration. The networks were evaluated for their ability to interpolate by comparison to the PLS results. For these data, the CCN trained in less than 1 min, but the prediction results were much worse than for either PLS or TC-CCN. This task was accomplished by removal of the PCB isomers with the same chlorine number from the training set of spectra. The networks were trained with remaining spectra, and the spectra that were removed were used for the prediction set. Monitoring sets are used frequently for configuring nonlinear calibration models. These data were the duplicate spectra obtained from the Wiley MS library and were generally worse in quality. The duplicate spectra were obtained from the same compounds as those used in the training data. When the monitoring set was used, all the spectra with the same number of chlorine atoms as used in the prediction set were also removed from the monitoring set. When the entire monitoring set was used (excluding the spectra with same number of chlorine atoms as in the prediction set), poor results were obtained. The poor results were due to the different distribution of congeners in the duplicate and reference spectra. Therefore, reduced monitoring Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
1303
Figure 7. Distribution of PCB congeners in the training data set. Figure 9. PCA scores of the Modulo 18 compressed PCB spectra that accounted for 85% of the cumulative variation. The compressed spectra were centered about their mean and normalized to unit vector length. The numbers indicate the Cl number of the PCB spectra.
Figure 8. PCA scores of the training mass spectra that accounted for 45% of the cumulative variation. The spectra were centered about their mean and normalized to unit vector length. The numbers indicate the Cl number of the PCB spectra.
sets were composed of PCB spectra with chlorine numbers one greater and one less than the chlorine number of the prediction set. Both PLS and TC-CCN constructed models until the monitoring set error increased. The set of latent variables for PLS or hidden units for TC-CCN that yielded the lowest prediction error was used for the model. Besides evaluating the networks with underdetermined data (i.e., more variables than objects), the mass spectra were compressed so that overdetermined data were generated (i.e., fewer variables than objects). The evaluations were conducted that used both compressed and uncompressed mass spectra. Two modes of training the networks were used. The networks were trained to a predefined relative training error (RMSSEC), and a training method that used monitoring sets of external data was used to optimize the models. PLS and the neural networks were trained until the prediction error of the monitoring set increased. The models that yielded the lowest prediction error were used. Monitoring sets are frequently employed for optimizing calibration models. The distribution of the PCB data for the training sets is given in Figure 7. The scores of the normalized MS are given in Figure 8. For the PCA characterization of the data, the data were first normalized and then mean-centered. The data were not meancentered before the TC-CCN evaluations. In addition, the MS data 1304 Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
were compressed using the modulo method of preprocessing.23,24 This method sums the intensities of peaks with the same remainder when the mass-to-charge ratio is divided by an integer. The PCA score plot for these data is given in Figure 9. Different divisor values from 35 to 38 were evaluated by PCA for compressing the spectra. The divisor of 36 gave the best distribution, as defined by grouping of the spectra of the same chlorine numbers on the first two principal components. In addition, the amount of relative variance that was spanned by the first two principal components was largest. A divisor value of 18 gave the same result as 36. Each spectrum was compressed from 464 to 18 points using this procedure. The batch training feature of the CCN was evaluated for optimizing a pool of candidate units. The unit that furnishes the largest covariance was selected from the pool for the network. The hidden units were trained 10 at a time, and the one with the largest covariance was used in the network. For the uncompressed data, the average training time was 12 min for each network. When only a single unit was trained at a time, the training time decreased to 2 min for each network. For compressed mass spectra, the networks trained in less than 1 min. Batch training did not affect the precision or the accuracy of the predictions. In some cases, the RMSSEP increased with the units with batch training. For the results reported below, only a single hidden unit was trained at a time. For difficult data that pose convergence problems for neural networks, batch training a TCCCN may be helpful. With temperature constraints, no convergence problems were obtained for these studies, with the exception of the pathological data. The networks were trained and compared to PLS results. The PLS algorithm was implemented as PLS-1.25 The results for the CCN evaluations were omitted from this paper, because the CCN took longer to train and generally yielded worse results. The prediction of the CCN performed poorly for classification, interpolation, and or generalization. This result is due to the (23) Crawford, L. R.; Morrison, J. D. Anal. Chem. 1968, 40, 1469. (24) Tandler, P. J.; Butcher, J. A.; Tao; H.; Harrington, P. B. Anal. Chim. Acta 1995, 312, 231-244. (25) Geladi, P.; Kowalski, B. R. Anal. Chim. Acta 1986, 185, 1-17.
Table 4. Compressed Data with 80% Training Cl no.
a
PLS
TC1
TC2
TC3
TC4
av ( SDa
2 3 4 5 6 7 8
1.57 0.18 0.76 0.35 0.43 0.63 0.98
0.39 0.39 0.53 0.52 0.48 0.29 0.59
0.38 0.27 0.52 0.50 0.51 0.19 0.59
0.31 0.66 0.53 0.51 0.46 0.37 0.59
0.26 0.14 0.52 0.53 0.52 0.29 0.63
0.34 ( 0.06 0.37 ( 0.22 0.53 ( 0.01 0.52 ( 0.01 0.49 ( 0.03 0.29 ( 0.07 0.60 ( 0.02
av ( SD
0.70 ( 0.47
0.46 ( 0.10
0.42 ( 0.15
0.49 ( 0.12
0.41 ( 0.18
0.45 ( 0.12
TC2
TC3
TC4
av ( SDa
Results in this column were calculated from TC1-TC4.
Table 5. Uncompressed Data with 80% Training Cl no.
a
PLS
TC1
2 3 4 5 6 7 8
2.19 1.18 0.39 0.25 0.33 1.31 2.17
0.45 0.26 0.53 0.17 0.68 0.30 0.19
0.40 0.24 0.53 0.15 0.69 0.23 0.20
0.36 0.29 0.56 0.14 0.70 0.22 0.23
0.42 0.28 0.62 0.17 0.73 0.23 0.22
0.41 ( 0.04 0.27 ( 0.02 0.56 ( 0.04 0.16 ( 0.02 0.70 ( 0.02 0.25 ( 0.04 0.21 ( 0.02
av ( SD
1.12 ( 0.84
0.37 ( 0.19
0.35 ( 0.20
0.36 ( 0.20
0.38 ( 0.22
0.36 ( 0.20
TC3
TC4
av ( SDa
Results in this column were calculated from TC1-TC4.
Table 6. Results for the Compressed Monitoring Set PLS
a
TC1
TC2
2 3 4 5 6 7 8
0.74 0.14 0.28 0.17 0.21 0.35 0.44
0.53 0.20 0.53 0.52 0.48 0.49 0.59
0.53 0.19 0.52 0.41 0.51 0.32 0.59
0.19 0.40 0.53 0.42 0.46 0.38 0.59
0.26 0.33 0.52 0.44 0.52 0.29 0.63
0.38 ( 0.18 0.28 ( 0.10 0.53 ( 0.01 0.45 ( 0.05 0.49 ( 0.03 0.37 ( 0.09 0.60 ( 0.02
av ( SD
0.33 ( 0.21
0.48 ( 0.13
0.44 ( 0.14
0.43 ( 0.13
0.43 ( 0.14
0.44 ( 0.11
TC2
TC3
TC4
av ( SDa
Results in this column were calculated from TC1-TC4.
Table 7. Uncompressed Data with Monitoring Set Cl no.
a
PLS
TC1 0.45 0.33 0.59 0.15 0.23 0.36 0.69
0.40 0.34 0.58 0.12 0.25 0.49 0.26
0.36 0.35 0.62 0.10 0.33 0.45 0.25
0.42 0.32 0.67 0.13 0.29 0.49 0.67
0.41 ( 0.04 0.34 ( 0.01 0.62 ( 0.04 0.13 ( 0.02 0.28 ( 0.04 0.45 ( 0.06 0.47 ( 0.25
0.48 ( 0.41
0.40 ( 0.19
0.35 ( 0.16
0.35 ( 0.16
0.43 ( 0.20
0.38 ( 0.16
1.87
0.85
0.84
0.28
0.80
0.69 ( 0.28
2 3 4 5 6 7 8
1.17 0.80 0.19 0.11 0.20 0.70 0.21
av ( SD 8b
Results in this column were calculated from TC1-TC4. b Results obtained from monitoring set of only hexachlor spectra.
maximization of the hidden unit covariance that causes a trained unit’s outputs to change abruptly from 0 to 1. The results from the first study are given in Table 4. This study used the compressed spectra, so that the data were overdetermined. Both PLS and the TC-CCN were trained until the relative training error was less than 20%. The relative error is defined as the RMSSEC divided by the standard deviation of
the properties (i.e., number of chlorine atoms). The TC-CCN networks were trained four times while varying the initial random weight vectors. In addition, they were trained with 10 candidate units. The RMSSEP values are reported for the number of chlorine atoms in the prediction set. Evaluations for spectra obtained from monochloro- and nonachlorobiphenyls were omitted, because none of the models performed well when they had Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
1305
to extrapolate for these prediction sets. The same evaluations were conducted on the expanded data that were composed of 464 variables. Both PLS and the TC-CCN were trained until 80% of the dependent block variance was represented by the models. These results are reported in Table 5. A couple of comparisons can be made with Table 4. Although the compressed data gave better results for the PLS evaluation, better results were obtained for the TC-CCN evaluations on the underdetermined data. The two methods performed comparably. The results that used the monitoring sets are given in Tables 6 and 7 for the compressed and expanded data, respectively. For PLS, compressing the data improved the predictions; however, it did not affect the TC-CCN prediction performance as much. Furthermore, the use of a monitoring set did not improve the TCCCN models, although it did provide a significant improvement for the PLS predictions. When the monitoring sets were used, PLS performed better than the TC-CCN for the compressed data. For the octochlorobiphenyls, two monitoring sets were evaluated. In the first attempt, there was only a single nonachlorobiphenyl duplicate spectrum, which was omitted from the monitoring set. Therefore, in this case, the monitoring set was composed of only hexachlorobiphenyl spectra, and the results were poor for both the TC-CCN and PLS. By adding the single nonachlorobiphenyl spectrum to the monitoring set, the PLS evaluation showed a large improvement, while the TC-CCN gave modest improvement. For the PCB data, several conclusions may be drawn. Typically, one would like to build models from unprocessed data with minimal prior assumptions or processing. The Modulo preprocessing method worked well, because the problem set is narrowly defined to a class of similar molecules. The choices of 36 and 18 are related to the atomic weight of Cl. For complex problems, compressing the data may be more difficult. PLS appeared to be sensitive to direct analysis of unprocessed data, and compressing the data to an overdetermined form (more objects than variables) improves the predictive accuracy. Compressing the data has a small but deleterious affect on the TCCCN predictive accuracy. The preprocessing method employed required a priori chemical knowledge of the data, which for many complex problems will not be readily available. The simplest approach would be to train the networks to desired target levels of predictive accuracy. This approach is seldom taken because the unconstrained networks are prone to overfitting the training data. Monitoring sets are frequently employed for PLS and neural networks to help prevent overtraining. The composition of the monitoring set may have a pronounced affect on the predictive accuracy. For the monitoring set data, the same trend is observed. Finally, it was hoped that the lack of precision in predictions across the networks trained at varied initial conditions might provide a measure of prediction accuracy. Regions of the model that were not well specified by the training set would yield ambiguous predictions with different starting conditions, which would result in a loss of prediction precision. This affect was studied for PCB mass spectra, and no correlation was observed between prediction precision and accuracy. Due to the nature of the training and prediction sets, we observed that predictions with 1306 Analytical Chemistry, Vol. 70, No. 7, April 1, 1998
larger errors were precisely obtained among networks trained under different initial conditions. CONCLUSIONS The CCN approach furnishes a rapid algorithm for building nonlinear network models. These networks automatically configure their own network topology, have the advantage of incremental learning, and can model high-order functions with fewer hidden units. This architecture is ideal for optimizing the hidden units with respect to fuzzy feature selection. A criterion was developed that determines an optimal temperature (i.e., degree of fuzziness) for each hidden unit. This criterion maximizes the curvature of the covariance response surface, helps avoid problems with network paralysis, avoids local minima, and allows the network to generate space-filling models between training points. Furthermore, the hidden units will model large variances in the input data, due to the fixed length of the weight vector and the maximized covariance criterion. The TC-CCN also avoids the extra preprocessing step of scaling the input data, so that it does not immediately paralyze the network during training. Therefore, faster training times may be achieved through the avoidance of training multiple candidate units. The advantage of this approach is that networks yield more stable and predictable models that are capable of interpolation and generalization as opposed to classification. In the PCB study, the results were compared to those of PLS, and performance was complementary for the TC-CCN. The TC-CCN performed better at the ranges of the property variables, while PLS tended to perform better near the mean of the property variables. In addition, the PLS predictive performance improved when the data were compressed, while the TC-CCN performance worsened. The choice of preprocessing required a priori chemical knowledge that may not be readily available for many chemical problems. These networks are capable of training with a variety of hidden unit types in the candidate pool. In addition, the output units also may incorporate a nonlinear function. Future work will investigate optimizing the algorithm for speed and study indicators for prediction reliability. ACKNOWLEDGMENT The U.S. Army Edgewood Research Development and Engineering Center is acknowledged for funding this work in part under contract DAAM01-95-C-0042. Chuanhao Wan, Ron Tucceri, Paul Rauch, Eric Reese, Erica Horak, and Chunsheng Cai are thanked for their helpful suggestions and comments. This work has been presented in part at the Adaptive Parallel Computing Symposium-96, Dayton, OH, August 8-9, 1996, at the 22nd Meeting of the Federation of Analytical Chemistry and Spectroscopy Societies, Cincinnati, OH, October 16, 1995, and at the Fifth Scandinavian Chemometrics Conference, Lahti, Finland, August 26, 1997. The work has been published in part at the conference proceedings of the Adaptive Parallel Computing Symposium-96.
Received for review August 8, 1997. Accepted January 29, 1998. AC970851Y