Anal. Chem. 1993, 65, 2167-2168
2167
Sigmoid Transfer Functions in Backpropagation Neural Networks Peter de B. Harrington Center for Intelligent Chemical Instrumentation, Department of Chemistry, Clippinger Laboratories, Ohio University, Athens, Ohio 45701 -2979 Backpropagation neural networks are vigorously being applied to a broad range of problems in chemical analysis. Recently, investigators claimed to have developed a novel bipolar sigmoid transfer function.' The same function has also been reported to behave unstably about the inflection point and to require training rates much lower than the sigmoid function.2 The use of bipolar transfer functions was first reported to increase training rates by an average of 3050% in 1987.3 The hyperbolic tangent was one of the earliest bipolar transfer functions.4 Bipolar and conventional sigmoid functions are equivalent in that a network composed of bipolar sigmoid functions does not possess any new modeling capabilities. A bipolar sigmoid function is an affine transformation of a conventional sigmoid function. A linear transformation is performed with the weights of processing elements in subsequentnetwork layers, so that bipolar and conventional sigmoid functions are equivalent. Training output values can be scaled to match either bipolar or sigmoid processing elements in the output layer. Therefore, a novel transfer function must have a response that differs in a nonlinear manner from the sigmoid function (e.g., the Gaussian function). A sigmoid function is given by
for which oj is the output of the jth neuron and netj is the linear activation of the neuron. Net is obtained by V
netj = C w i j o i
(2)
r=l
for which v is the number of input connections, w i j is a component of the weight vector, and oi is the input activation of the ith neuron in the preceding layer. Bipolar outputs can be obtained by applying a general function. *oj = n(oj - 1/ J (3) The scale factor (n) determines the range of the bipolar output (*o$ that spans from -n/2 to n/2. A typical value for n is 2, which yields a bipolar function whose outputs range between -1 and +1. The range of output is twice that of a sigmoid function. The scale factor (n) will be referred to as the extent of the transfer function. For an extent of unity, the range is identical to that of the conventional sigmoid. The hyperbolic tangent is related to a bipolar sigmoid as shown tanh(net) = 1 + 2e-2net - 1
(4)
for which tanh is the hyperbolic tangent. Although the hyperbolic tangent differs nonlinearly from a bipolar sigmoid, ~
(1)Li, Z.;Cheng, Z.; Xu,L.; Li, T. Anal. Chem. 1993,65, 393-396. (2)Bos,A.;Bos,M.;vanderLinden, W.E. Anal. Chim.Acta 1992,256, 133-144. (3)Stometta, W. S.; Huberman, B. A. An Improved Three-Layer Backpropagation Algorithm. In Proceedings of the IEEE First International Conference on Neural Networks; Caudill, M., Butler, C., Eds.; SOS Printing: San Diego, CA 1987. (4)Rumelhart, D. E.;Hinton, G. E.; Williams, R. J. Learning Internal Representations by Error Propagation. In Parallel Distributed Processing; Rumelhart, D. E., McClelland, J. L., Eds.; M.I.T. Press: Cambridge, MA, 1986; Vol. 1,Chapter 8. 0003-2700/93/0365-2 167$04.0010
these functions are equivalent in backpropagation networks. The linear activation of a bipolar sigmoid unit could double the magnitude of its net value, so that ita nonlinear responses are the same as the hyperbolic tangent. Enhancement by bipolar transfer functions of training rate is caused by two factors. The first factor is the increased range of output and increased magnitude of the derivative for an extent that is greater than unity. When the error is propagated backwards through the network during training, the error is scaled by the first derivative of the transfer function. The derivative of the sigmoid function is ?(netj) = oj(l - oj) (5) For bipolar functions this derivative increases by a factor of n, which increasesthe scale of the error propagating backwards during training. Weight adjustment is obtained by wij(t+l) = wij(t) + qf/e,oi (6) for which wij(t+l) and w&) are the ith weight component of the jth unit's weight at training cycle t and t + 1. The backpropagated error is ej. The derivative of the transfer function is fj', and oi is the output from the ith unit in the previous layer or the ith variable of the input vector if j corresponds to the first hidden layer. The training rate coefficient is q. Three terms in eq 6 are proportional to the extent: oi when i is a bipolar unit, fj', and ej. For an extent of 2, a 4-fold increase in training rate is observed in the first hidden layer and an 8-fold increase is observed in subsequent layers. The second factor for bipolar enhanced training rates pertains to preprocessing the input data. For binary data that are sparse (i.e., contain many null values) the training rate is less than for bipolar encoded data. From eq 5, one can see that no weight adjustments occur for null input values (i.e., oi = 0.0). Bipolar encoded input data can be used with conventional sigmoid networks or sparse binary data can be inverted to accelerate training. Null outputs from units in the network are less likely to occur from conventional sigmoid units than bipolar units, because the nullvalue is an extremum of the conventional sigmoid function. As a result, some networks composed of bipolar units may train slower than conventional backpropagation networks. A rigorous mathematical proof is beyond the scope of this letter and the author's ability. The following argument may suffice as an explanation. Backpropagation neural networks model relations between inputs and outputs by adjusting weights during training so that prediction error is minimized. If a network has a linear transfer function, then a network of multiple layers can be represented as a network of a single layer that is the product of the weight matrices of each layer. Nonlinear transfer functions between layers allow multiple layers to furnish new modeling capabilities. If different nonlinear transfer functions are used that are affine transformations of each other (e.g., sigmoid and bipolar), then the weights may be adjusted to obtain the same minimum error a t each set of connections between layers. Evaluating backpropagation neural networks is complicated, because they are prone to local minima. It is important to vary the random initial weights before training and report results statistically as with any analytical measurement. 0 1993 American Chemical Society
2168
ANALYTICAL CHEMISTRY, VOL. 65, NO. 15, AUGUST 1, 1993 Tralnlng Error
Table I
A
2 -
loo
i
training rate (T) momentum decay sigmoid prime offset
a 2 I.
Y
--tg bco
0.004/0.008° 0.6 o.Ooo1 0.01
B
C
0.001 0.6 0.0001 0.01
0.001 0.6 0.0001 0.01
10-1;
E
a
First layer/second layer.
3 -
p -
10-2 i
a a -
-
Figure 1 gives the training relative error as a function of training cycle. These examples used a two-layer network for which the hidden layer is comprisedof 20 units and the output layer is comprised of two units. The network parameters are given in Table I. The network with the corrected sigmoid function trains equivalently to the network composed of bipolar transfer functions. Thisresult should be independent of training set. Novel transfer functions should furnish new modeling capabilities for backpropagation neural networks. Faster training rates reported for networks with bipolar transfer functions may be caused by their larger extents. The same result may be obtained with conventional sigmoid networks with corrected training rate coefficients. In some cases the use of bipolar transfer functions may reduce the network training rate. Significant variations exist in training rates due to initial random weight selections. When neural networks are being evaluated, some measure of precision should be included.
ACKNOWLEDGMENT I thank the reviewers for their helpful comments.
RECEIVED for review March 9, 1993. Accepted May 14, 1993.