Opening Up the Black Box of Artificial Neural Networks - Journal of

May 1, 1994 - As a computational tool, neural networks are a rapidly emerging ... Journal of Chemical Information and Computer Sciences 2000 40 (5), 1...
0 downloads 0 Views 6MB Size
Opening Up the Black Box of Artifical Neural Networks M. T. Spining The University of Tennessee, Knoxville, TN 37996-1600

J. A. Darsey

University of Arkansas at Little Rock, Little Rock, AR B. G. Sumpter and D. W. Noid Chemistry Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831-6182

In the past decade, there has been tremendous erowth in studies A d applications of neural networks (I-7j:Amajor impulse for this was provided by breakthroughs in learning algorithms, such as Hopfield, Kohonen, and backpropagation, and fmm technolozical advances in personal Hnd mainframe computing that enable neural network simulations to be carried out more efficiently. As a computational tool, neural networks are a rapidly emerging technology that can significantlyenhance analysis or even provide solutions to a number of very difficult problems. Neural networks offer a broad range of applicability. For example, they have been used in spectral analysis, X-ray characterization.. r.rote in structure orereading. -. wlvmer . diction, quantitative structure-activity or structure-property relationships, and estimation of aqueous solubilities of organic compounds. Such versatility suggests that the fundamental applications of neural networks in science. and especially chemistry, have far-reaching ramifications. In this paper we give a general overview of neural networks so that a larger group of chemists can become familiar with this field. Neural networks are most commonly called artificial neural networks (ANN'S) to clarify that they are not a biological system. For most purposes, ANN'S are a suite of computer software. that is. s i m ~ l eDromams eenerallv writ& in C, ~ o r t r a hBasic, , or ~ a ' s c a l : ~ hneurznetwork i domain of software is best defined bv the oroerams that help simulate possible (because we have iittre concrete knowledge of how the brain works) functions of the brain (13).AS-can be seen from the d e f ~ t i o nthere , are numerous Dromams that fall into this classfiation. Therefore, ne&al networks are commonly divided into certain catego: ries. In this paper, neural networks are divided according to training methods-supervised and unsupervised. Supervised training is used when a training set consisting of inputs and outputs (examples with known results) is available. The network uses the training set to determine an error and then adiusts itself with resoect to that error. Unsupervised networks are used when training sets with known o u t ~ u t are s not available. for examole.. for realtime learning. ~ h e s enetworks use the inputs to adjust themselves so that similar input gives similar output (regions with large correlations). Another classification that will be used is feedforward and feedback networks. In a feedforward network, information is propagated through the network in one direction until it emerges as the network's output. However, in a feedback (recurrent) network, the input information is propagated through the network but can also cycle back into the network (the signal is recurrent).

-

.

Research sponsored by the Division of Materials Sciences, Office of Basic Energy Sciences, U.S. Department of Energy, under contract DE-AC05-&$OR21400 with Martin Marietta Energy Systems. 406

Journal of Chemical Education

In the present paper, we give a fundamental overview of feedforward neural networks, present some applications using them in chemical physics, and comment on the potential for future uses in chemistry. We begin by discussing some specific types of neural networks that provide the generality needed to pursue applications in the chemical sciences. Supervised Networks A backpropagation network-perhaps the most used neural network-is an example of a feedforward network with supervised training. (Some other examples are Mean Field annealing, recurrent cascade correlation, recurrent backpropagation, perceptmn, cascade correlation, BrainState-in-a-Box, Fuzzy Cognitive map, Boltzmann Machine. Learnine Vector Quantization. Adaline. Madaline. cauchy ~ a c h i k e ~, d a ~ t & ~ eeu r i s t i eCritic, k e ~ e l a y Neural Network. Associative Reward Penaltv. Avalanche Matched Filter, backpercolation, ARTmap, &d ~ d a ~ t i v e Logic Network.) As discussed above, a supervised network must have a training set that consists of an input vector paired with a corresponding output vector. The backpropagation network uses this training set to learn any general features (acting as a feature detector) that may exist in the training set. Once adequately trained, the network can make predictions on new data that was not used in the training (called generalization to a test set). The Backpropagation Network The backpropagation network is composed of a collection of nodes (also called neurons, processing elements, neurodes, etc.). The node was developed from speculation of actual brain neuron activity any i n t r o d u d i k to neural networks (1-5) go into some of the specifics of the functioning of brain neurons. However, that will not be discussed here. The node (Fig. 1) sums the product of each connection weight (wjJ from a node j to a node k and an input (zj)to get the value SUM (see eq 1)for node k. This sum is simply the dot pmduct of the input and weight vectors.

It can be conveniently represented by matrix notation as where M is the layer. In the vector notation, an additional dot product is used to give y, which is called the bias value. The output of a bias j is always 1.0, and the weights y's are treated in the same fashion as the wik's. This additional set of weights gives the network added degrees of -flexibility, which enables it to

+

INPUT

HIDDEN

LAYER

LAYER

OUTPUT LAYER

F~gure3. r n ~ s a two-layer ( i n p ~layer t s not counted)backpropagaton network w ~ ~nput h vector x , nput node 1, m htdden nodes, n o~lputnodes, ano output vector OJT.

SUM

(b)

Figure 1. (a)is the composition of the neuron that will be represented by (b).

solve more difficult problems. The value SUM is then applied to a transfer function and outputs a value OUT. Pansfer Function Any continuous and differentiable function may be used as a transfer function. However. the loeistic function (also known as the Fermi function where related to an inverse temperature) is most prevalent (see Fig. 2).

This is because it has a simple and continuous fnst derivative.

SlGMOlD ACTIVATION FUNCTION

I

I

I

I

I

The transfer function (also known as an activation or squashing function) forces the output to be within a specific range, usually zero to one for the logistic function, indicating an active or inactive node. The "pseudo-temperature" p effectively changes the steepness of the logistic function: 0 < p c goes from an approximately linear to a step function.

-

Neural Network Architecture

An example of a neural network architecture is shown in Figure 3. The network is divided into three layers-input, hidden, and output. (This is most commonly called a twolayer network since the input layer does no amputation.) It has only one input layer and one output layer. However, any number of hidden layers may be used. Unfortunately, there are no rigorous rules to determine the number of hidden layers. However, in general, no more than two hidden layers are needed to model any mathematical function. More layers of hidden nodes may help solve the problem more efficiently, but two layers is all that should ever be needed. The nodes in the input layer are simple distributive nodes, which do not alter the input value at all. The output and hidden layers are made up of the nodes described above. The number of nodes in the input and output layer are defined by the problem being studied. However, there is no easv wav to know how manv nodes to incornorate into the hiddkn layers. If too many nodes are used, tge network will not zive mod generalization but will most likelv memorize th; training set (over-parameterization). b n the other hand, if too few nodes are used, the network will not learn the training set (under-parameterization). Gradient Descent Training

Figure 2. The logistic function where OUT(y)is 1 for iarge positive values of SUM(* and zero for iarge negatives values of SUM(*.

The backpropagation network uses the training set to adjust its connection weights so that the network's output error tends toward a minimum. This process of training is approximately equivalent to a gradient descent calculation. To better understand this method, picture the error mapped onto a three-dimensional space, where valleys are low error and mountains are high error. The network determines (from where it is on the surface) which direction leads to a region with less error. Then it takes a step (proportional to the size of the leaming rate) in that direction. Volume 71 Number 5 May 1994

407

Although this is a very effective method, it has many problems. If the learning rate is too large, then minima can be oversteooed. However. if it is too small. the leamine process cantake a very lokg time. There is &so a problem of finding the global or absolute minimum. If the network finds a minimum, it may not be able to get out of that minimum, even if it is not the global minimum. Both of these problems have been explored, and options exist to help overcome them by adding features to the learning process (14).Despite these limitations, the gradient descent method is an effective tool in decreasing the network's error, and it defmes the major part of the backpropagation learning rule. Computing the Error

The backpropagation network "learns" by adjusting its weights according to the error, which is generally found by subtracting the target value (the desired output) in the training set from the value the network determines as an output. It is relatively straightforward to see how the error is computed for the output layer. To determine the error for the hidden layer nodes is less clear but nevertheless still quite straightforward. First, we give the error signal for any node i in the output layer a definition of si = f (SUMi)* (target,- O m ; ) (5)

The sewnd quantity is the derivative of the transfer function using the SUM of node i. With this 6, use eq 7 to get the change. Then by adding that to the old weight, a newly adjusted weight is found. The 6 may be propagated back throwh the entire network in the same fashion to make appropriate adjustments to the connection weights in all hidden layers. See refs 3-5 for a more detailed derivation of the backpropagation algorithm. UnSIIpe~iSedNetworks

There are a number of neural networks that fall into the category of unsupervised networks, for example, Adaptive Resonance Theory (ART1, ART2, AR2, AR3, Fuzzy ART), Adaptive Grossberg, Shunting Grosberg, Hopfield nets (discrete, continuous), Bidirectional Associative Memory (BAM, ABAM), Kohonen Self-organizing maps, Kohonen Topology-preserving maps, Temporal Associative Memory, Learnillg Matrix. Driver-Reinforcement Learnine. -. Linear Associative Memory, Optimal Linear Associative Memory, Soace Distributed Associative Memory. -. Fw.zv Associative hiiemory, and Counterpropagation.

-

Kohonen Network

A representative example that is relatively simple is the algorithm used by Kohonen to develop topology-preserving feature maps. A Kohonen network is also an example of a feedforward network with unsupervised training. The network does not require a training set (input and output oairs) because it adiusts itself to determine correlations within the input daia. After training terminates, similar i n ~ u t will s have similar resoonse from the outout nodes. A ~ihonen network consists if an input layer a i d an output laver connected bv " weiehts. h example of a network is shown in Figure 4. The input laver is the same size (same number of nodes) as the i n ~ u t vectors. The output layer comprises as many nodes as you decide. Each output node has a weight vector associated with it and has the same dimension as the input vector. The output is given by the dot product of the output nodes' weight vector and input vector. However, the output nodes are not used during training, only in interpreting the results.

.

is the derivative of the transfer function. One approach is to do what is called batch processing where thee's for each node i are summed over all patterns (p) from the training set and divided by the number of patterns (0to get the 6 for node i.

Then 6, is multiplied by the product of the learning rate (q) (usually between 0.01 and 1)and OUT; for node i to get the desired change for the connection weight from node m in the previous layer to node i, which should incrementally decrease the error when added to the previous connection weight. Awe = SiqOUTi (7) This produces a new connection weight that has taken a step of size q toward less error. The batch processing of this input data defines a learning rule that is consistent with a true gradient descent. Asecond and popular technique is to process each pattern individually: Calculate the 8s for an input, backpropagate the error, and then process another input. This is an approximate gradient descent as long as the learning rate q is kept very small. (See refs 1 3 for more details.) The error for a node in a hidden layer is not as easily found because there is no target value to compare with the output. However, by propagating back the error signals of the previous layer, we can get an error signal for that node. In other words, get 6 for nodes i in the last hidden layer p by multiplying two quantities (see eq 8).

The first quantity is the sum of the product of the weights from node i to node j associated with the output layer (p + 1)and 6 for that neuron (i) to get the 6 of node i. 408

Journal of Chemical Education

-

The Daining Process Before the network can train itself, certain procedures must be carried out on the inputs and weights. Each input

INPUT LAYER

OUTPUT LAYER

Figure 4. This is a Kohonen network with input vector x, input node 1, m output nodes, output vector OUT, and weight vectors s,f and u.

vector must be normalized so that the Euclidean distance formula can be used. Normalized vectors all have the same length; only their direction is unique. The following equation is usually used to normalize input.

where T is the input vector; t is a vector component; and n is the dimension of the vector rime means before the correction,. As in the backpropagaiion network, the weights of a Kohonen network are initiallv randomized. (In ref2, the author suggests between 0.4 "and 0.6.) However, before training begins, the weights must be normalized (eq 9). The training process is simply a matter of moving the weight vectors closer (in space) to the similar input vectors. An input vector is chosen a t random, and then the weight vector closest to it is moved to be even closer to the input vector. The best way to visualize this is to consider a sphere in which all the weight vectors protrude from the center to some point on the surface. The input vector is then mapped-protruding from the center-in the sphere, and the weight vector closest to the input vector is moved toward the input vector. The distance the weight vector (w) is from the input vector (i) is computed by eq 10.

The weight vector with the smallest distance is determined to be the winning weight vector. The distance between the winning weight vector and the input vector is multiplied by the learning coefficient q to get the adjustment for the existing weight vector. By adding the adjustment to the existing weight vector, a new vector is obtained that is closer to the input vector by a factor of q.

The Neighborhood The Kohonen network implements the idea of the neighborhood of the winning weight vector. All of the weight vectors within a given radius of the winning weight vector are

I

MEXICAN HAT FUNCTION I

I

FlgJre 5 The Mexlcan Hat ILncnon IS one cho~ceof a fdnctton ta oel ne the earnlng rale ( y ) of tne ne~ghborhoooAs the d stance (4

considered to be the neighborhood. The neighborhood is adjusted with respect to the distance between the weight vector considered and the winning weight vector. Different uroerams use different functions to determine the mamitudz of the adjustment. The most effective function isihe Mexican Hat function (Fig. 5) because it decreases the adjustment as the weight vector is further from the winning weight vector. In fact, the function becomes negative for a certain radius, which emphasizes the boundary of the is detererouo. The learning rate (n)for the neighborhood k n l d by the function used. During training, the Kohonen network makes adjustments to the learning rate and the neighborhood size over time. The time is iterated after one pattern has been presented and all weights have been adjusted. Kohonen networks usually decrease the learning rate and the neighborhood size aRer a certain time has elapsed.

.. The work on applying neural networks to chemical systems has only recently been started. The first symposium on usine neural networks in chemistrv was heldat the National ZCS Meeting in 1991. The fir"st review to be published in the chemistry literature was written by Zupan and Garteiger (6). They reviewed the application of neural networks to sDectroscoDv (mass. IR. UV. and NMR). orotein structur< and st&ture-activity relationships. recently. Macziora et al. (7)have addressed a neural network approach to solving chemical quantitative structureactivity relationships (QSAR) or structure-property relationships (QSPR). These papers present (6,7)good literature reviews of the uses of neural networks in chemistry. The reader is referred to them for a more extensive discussion. Below we review some of the work not given in those references.

ore

Computational Chemistry

The work in our laboratory has focused on other areas of research, namely coupling of neural networks with computational chemistrv (9-13) and usine neural networks as a tool to aid in the calculation of heat capacities from experimental data (1615).

-

Relationships among Parameters Neural networks represent a valuable computational method coupled with various kinds of other methods, such as quantum mechanics (81, molecular dynamics (9-11), normal coordinate analysis (12), and Monte Carlo methods (13). A common problem for each of these methods is the relationship between the parameters in the potential euergy function and other parameters, such as number of atoms, temperature, etc. We have found that a feedforward network using backpropagation is useful in learning these relationships and can make useful predictions about the outcome of calculations not yet considered. (Most of these calculations are very CPU-intensive because they are related to polymer properties, that is, involve a large number of atoms.) Numerical Methods and Simulations Another possibility we have considered is using the neural network to wrrect systematic errors in the molecular dynamics method (16). This concept can be easily generalized to correcting systematic error in a variety of numerical methods. Finally, we have recently found neural networks to be useful in sorting through results from a simulation and identifying various groups in the data. In preliminary worknskg an AR'~network (17)(a recurrent network with a Kohonen laver), we discovered that clusters were developed that could be studied as a group. A Volume 71 Number 5 May 1994

409

major problem in polymer dynamics simulation i s t h a t a very large number of atom positions a r e generated for many time steps. The results a r e too voluminous to plot and, to compute various kinetic events, one must have a n idea of what to look for a n d where. We believe our results (18) show that t h e ART network will be helpful in this area. We have also explored t h e use of neural networks for processine e x ~ e r i m e n t adata. l I n this case. we found t h a t h e a t capaci