Relation between mutual information and classification

Received for review December 19,1975. Accepted February. 18,1976. Relationshipbetween Mutual Information and Classification. G. L. Ritter, S. R. Lowry...
0 downloads 0 Views 536KB Size
LITERATURE CITED ( 1 ) L. R . Snyder and H. J. Adler, Anal. Chem., 48, 1017 (1976). (2) C. Horvath. B. A. Solomon, and J.-M. Engasser, lnd. Eng. Chem., Fundam., 12, 43 (1973). (3) J.-M. Engasser and C. Horvath, lnd. Eng. Chem., Fundam., 14, 107 (1975). (4) H. Goldsmith and S.Mason, in "Rheology. Theory and Applications", F. Eirich, Ed., Vol. 4, Academic Press, New York, 1967, p 242. (5) G. I. Taylor, NuidMech., 10, 11 (1961). (6) J. C. Giddings, "Dynamics of Chromatography", Dekker, New York, 1965, p 154. (7) B. L. Karger, L. R. Snyder, and C. Horvath, "An Introduction to Separation Science", Wiiey-lnterscience, New York. 1973, p 78. (8) Ref. 7, p 76. (9) R. A. Robinson and R. N. Stokes, "Electrolyte Solutions", 2nd ed., But-

terworths, London, 1959. (10) K. Diem and C. Lentner, Eds., "Scientific Tables", 7th ed.,Ciba-Geigy Ltd., Basle. 1970, p 580. (11) G. Ertingshausen, H. J. Adler, and A. S.Reichler, J. Chromatogr., 42, 355 (1969). (12) R. Thiers, A. Reed, and K. Delander, Clin. Chem. ( Winston-Salem, N.C.), 17, 43 (1971). (13) R. Begg, Anal. Chem., 43, 854 (1971). (14) W. H. C. Walker and K. R. Andrew, Clin. Chim. Acta, 57, 181 (1974). (15) L. R. Snyder, J. Chromafog., in press. (16) J. C. Sternberg, Adv. Chromatogr., 2, 205 (1966)

RECEIVED for review December 19,1975. Accepted February 18, 1976.

Relationship between Mutual Information and Classification G. L. Ritter, S. R. Lowry, H. B. Woodruff, and T. L. Isenhour' Department of Chemistry, University of North Carolina, Chapel Hill, N.C. 27514

Mutual information Is an information theory term, which describes the amount of information to distinguish among members of different classes. This concept is demondated for describing the classification ability of a maximum likelihood classifier. in an artificlaily generated problem, the recognition ability and the square root of mutual information are seen to be linearly related. The linear correlation drops to 0.83 for a series of infrared questions where the statistical independence assumption is not strictly valid.

In most chemical problems, the goal is to find some transformation of measured experimental properties that reveals another more obscure chemical or physical property. Recently there has been interest in applying the principles of information theory to such chemical systems (cf. 1-3). Modern information theory is largely a result of work done by Claude Shannon in the 1940's ( 4 ) . Shannon was investigating communication systems where a message is transferred from a transmitter to a receiver. In this work, he assumed that the source of the message was stochastic, and he wished to exactly or approximately reproduce the message at the receiver. The solution to the problem required a definition of an uncertainty or "entropy" in the message a t the receiver. In chemical systems, the format of the information theory problem is usually similar. First, an experimenter measures an event, X ,which may take on any of several states, x i . The outcome of must fall into one of the states; call it Xk. The fraction of the time that X k occurs denotes a probability for the state, P k . Given this stochastic form for X ,information theory defines information or entropy as a measure of the uncertainty in predicting the state of X.Shannon suggested the following as a measure of the uncertainty for a system of n possible states.

x

Some other event, Y may occur in conjunction with X.A complete description of this system requires not only a probability scheme for the states of X ,but also a probability scheme for Y and a set of conditional probabilities. The conditional probability, denoted Pklj, indicates the probability of state x k arising when Y is in steady y j . From the complete

set of probabilities and conditional probabilities, the mutual information may be found. The mutual information is the amount of information in the event X to determine the state of Y ( 5 ) .These definitions suggest the application of information theory to chemical classification problems. In the classification problem, we wish to distinguish or discriminate an obscure chemical or physical property by use of a mathematical transformation of the measured experimental quantities. This property classifier is normally designed to minimize the number of incorrect categorizations in a collection of known data. If the classifier is to do well, then there must be information in the experimental data to determine the unknown property. To test the relationship between classification or discrimination and mutual information, we wish to choose a classifier that is stochastic and attempts to minimize the number of errors. This classifier is a maximum likelihood estimator. The spectroscopic data that we use to test the relationship are a series of substructure questions using infrared spectra. The spectra are coded in an intensity eliminated or peak/no peak form. From the experimentally measured binary infrared spectra, we attempt to estimate the amount of information to determine the various chemical substructures. In the classification decision, the category or substructure that is most likely to give the infrared spectrum of the unknown is selected. If the mutual information content is high, then the recognition ability should also be high.

DATA The data for this study have been taken from the American Society for Testing and Materials file of 91 875 binary infrared spectra. Thirteen mutually exclusive classes were chosen as the criterion for selecting spectra (Table I) (6).Of the spectra belonging exclusively to each of the classes, 200 were randomly chosen and placed into the data set. The resulting 2600 compounds each contain fewer than sixteen carbons and contain only hydrogen, oxygen, and nitrogen otherwise. Each compound is represented by a peak/no peak spectrum in the range 2.0 to 15.9 pm. This range is divided into 139 equal-sized intervals, each describing whether a peak maximum occurs in the corresponding interval. For convenience, therefore, each spectrum is given as a 139-dimensional binary vector x such that ANALYTICAL CHEMISTRY, VOL. 48, NO. 7, JUNE 1976

1027

Table 11. Example Probabilities for One Spectral Interval

Table I. Chemical Classes Used in the Study Class No.

j = 2

Chemical functionality 'I2

1

0

Carboxylic acid

I12

II

RCOH

0

I1

2

Ester

3

Ketone

4

RCOR

Table 111. Example Probabilities for Two Spectral Intervals

0

/I

P(Xi = S j I C j )

RCR

Alcohol

i = l

ROH 0

5 6

Aldehyde 1"Amine

7

2" Amine

i=2

I1

RCH RNH?

j= 1

c,=1

6,=,o

j= 2

P(Xj

= 6i)

-

-

'14

' I 2

31 I

' I4

'12

'18

6,=l

' 1 3

'is

"148

6, = 0

21'

'18

3'148

H

I

RSR

Table IV. Joint Probabilities if Statistical Independence is Valid

n,

8

3" Amine

9

Amide

P(Si,

6,= 1,6,= 1 6,= 1,6,= 0 6,= 0 , 6 , = 1 6,= 0 , 6 , = 0

Ether and acetal

ROR. RCH

12 13

Nitro and nitroso Nitrile and isonitrile

RNO,, RN=O RCS. RNC

\ OR"

and 1,if a peak maximum occurs in the 0.1-pm interval preceding (2.0 0.1 i)pm = 0, otherwise

+

This type of problem is motivated by the increasing size of many data files. In order to store the file easily, an intensity eliminated format may be used. Thus, special interest is needed in solving problems with binary data.

THEORY The classification methods used in this paper have been described extensively elsewhere and therefore will be reviewed only briefly. Recognition (7, 8). The classification algorithm assumes that all of the descriptive parameters, the infrared peak positions, are statistically independent. The classification scheme is illustrated in the following example. Suppose that an investigator decides that a classification decision may be made on the basis of one spectral interval (denoted by X I ) . From the information in this interval, the compound is to be classified into one of two classes, c1 or cp. The statistical maximum likelihood classifier requires the knowledge of four conditional probabilities, 1)the probability of a peak maximum in x 1 for members of c 1 , 2) the probability of no peak maximum in x 1 for c 1 , 3 ) the probability of a peak maximum in x 1 for c2, and 4) the probability of no peak maximum in x 1 for c p . These four probabilities are designated P(x1 = 1 I c l ) , P ( x 1 = 0 I CI), P(x1 = 1I c p ) and P ( x 1 = 0 I cp). Since a peak maximum must either occur or not occur in a given interval only two of these probabilities are independent, Le., since P(x1 1028

I l l 2

j = 2

'116*

'16

'/4*

'/1b*

,l2*

'116

,OR

11

x = ( x l , x 2 , x 3 , . . . xi,. . . x1.39)

xi

j = 1 -

SzICj)

ANALYTICAL CHEMISTRY, VOL. 48, NO. 7, JUNE 1976

+

= 1 I c j ) P(x1 = 0 I c j ) = 1,only P(x1 = 1 I c j ) is needed to describe c j . An example of these probabilities is shown in Table 11. The value for P(xl = 1 I c 1 ) implies that Y4 of all class 1 spectra have a peak maximum in X I . Similarly l/p of all class 2 spectra have x 1 = 1. The maximum likelihood decision then chooses the class which is most likely to have that response. Thus, if an unknown is found to have a peak maximum in x 1 (Le., x l = l),the decision will be class 2. Class 2 is chosen since K of cp have x 1 = 1 and only l/4 of c1 have x 1 = 1. At this point, the investigator may decide that a more accurate decision may be made if a second spectral interval, x p is added. The probabilities for x p are included in Table 111. Then, if the unknown is found to have no peak maximum in x p , which classification decision should be made? Or, which is larger P(1,O I c1) or P(1,O I cz)? If the intervals are assumed to be statistically independent, then P(1,OI c j ) will be the product of P(x1 = 1 I c j ) and P ( X Z= 0 I c j ) .

P(1,O1 c1) = P(x1 = 1 I c 1 ) * P(x2 = 0 I c1) = (I141 *

(Y3)

Since 5/16 of all c 2 spectra have a (1,O) spectrum and l/6 of c1 spectra have (1,O) spectrum, the decision is still class 2. Table IV presents the joint probabilities for each possible spectrum in both classes. The starred values indicate the maximum likelihood estimates. This concept is extended to three, four, or more spectral intervals in the obvious way. In turn, each probability is multiplied and the overall products are compared to make the classification decision.

Equation 3 may be written in a linear form when the following steps are taken. First the notation is changed so that pi I j = P(xi = 1 I c;) and then Equation 3 is rewritten

P ( x I c;) =

d

n (pi1

;)l-&i(l

i=l

-p’

1 1 1.)1-*i

The conditional entropies are summed for both classes to give H ( x I c). Each term in the sum is weighted by the relative class size. If P(Cj) is the probability that class j occurs, then

(4)

This is equivalent to Equation 1 since if 6i = 1, only pi I ;is included in the product and if 6i = 0, only (1 - pi I ;) is included in the product. If the logarithm of both sides is taken, then the class is chosen such that the following term is maximized

The class probabilities must be consistent with the maximum likelihood classifier that is used earlier. In this classifier, each class is assumed to be equally probable, or P(c1) = P(c2) = %. Using this, H ( x I c ) is calculated below.

H ( x I c ) = -%[% log l14 + 314 log 314 + f/3 log ‘13 + 213 log 21’1 -1hph log I 1/2 log H + 3h log 3h + v8 log 51~1 = 1.84 bits (11)

+

Information T h e o r y a n d M u t u a l Information (9).The mutual information is the average amount of information to distinguish members of class 1 from members of class 2. This is best explained and calculated as the difference between the total average entropy of the spectra and the average conditional entropy of the spectra. The total average entropy, H ( x ) , is the average amount of information to distinguish among the spectra. In other words H ( x ) is the average amount of information to distinguish 1) among members of class 1,2) among members of class 2, and 3) among members of class 1 and class 2. If all of the spectra have some interval in which no peak maximum occurs, then pk (the probability of a peak occurring in interval k regardless of class) is zero. Then, in this spectral interval, there is no information to distinguish a spectrum from any of the others, or H ( x 1 ) should be zero. However, if, in the spectral interval, a maximum does occur in some of the spectra, then there should be some information present to discriminate among spectra and H ( x 1 ) should not be zero. The value of H(x1) when two responses are possible is given by

H h i ) = -pi log pi

- (1 - pi) log (1 - pi)

(6)

Normally, the logarithms are taken to the base 2 and the units of the result are “bits”. According to this formula, the maximum information occurs when dHldp1 = 0; this occurs if p1 = I , . The total average entropy when p1 = lh is

- % log % = 1 bit

H ( x 1 ) = -% log lh

(6a)

If the measurements are statistically independent, the total average entropy for x is the sum of the entropies for each parameter. d

H ( x )= d

=-

C

i=l

bi

log Pi

C H(xi) i=l

(7)

+ (1 - p i ) log (1- p i ) ]

In the example given in Table 111, the last column indicates pi if the classes are assumed to be the same size. The total average entropy is calculated below.

H ( x ) = -% log % - 5/s log 78 -

log ‘748

-

3 1 / 4 ~log 31/!s=

1.89 bits

(8)

Therefore, there is an average of 1.89 bits to distinguish the spectra. The average conditional entropy, H ( x I c ) , measures the average information to distinguish among members of the same class. The conditional entropy for c 1 is calculated as shown below where the conditional probabilities are used in the calculation. d

H(x1

~ 1 = )

- E [pit 110gpi11 + i=l (1 - pi 1

1)

log (1 - Pi I 111 (9)

Thus, there are 1.84 bits of information to distinguish among members of the same class. The difference between the total average entropy and the average conditional entropy is the average information to distinguish among members of class 1and class 2. This is the mutual information (M.I.). M.I. = H ( x ) - H ( x 1

C)

(12)

In the example the mutual information is M.I. = 1.89 - 1.84 = 0.05 bit

(13)

Then only 0.05 bit of information is available to distinguish among the members of the two classes. This is the basis for the expected parallel between recognition ability and mutual information. RESULTS In the example problem, the mutual information is 0.05 bit and the recognition ability is still unknown. This recognition may be estimated if the independence assumption is true. In this case, the probability for each possible two-dimensional spectrum for each class is collected in Table IV. Given a maximum likelihood decision, if an unknown spectrum is (1,1),it will be placed into class 2. However, this forces the y12 of all class 1 spectra which are represented by (1,l)to be incorrectly classified. Since the classes are assumed to be equiprobable, $4 of all possible spectra are misclassified if the spectrum is (1,l).Similarly of the class 1spectra (or l/12 of all the spectra) are misclassified if the spectrum is (1,O). The total proportion misclassified then is 1 1 3 5 3 error rate = - + - + - + - = 24 12 24 32 8 Then three out of every eight spectra are misclassified given this set of probabilities and the expected recognition is 62.5% for this problem. The relationship between M.I. and recognition is first tested by randomly generating a set of tables similar to Table 11. Once the values of P ( x l = 1 I el), P(x2 = 1I CI), P ( x l = 1 I c 2 ) and P(z2 = 1 I c2) have been generated, the M.I. and expected recognition are calculated as shown above. A plot of M.I. vs. recognition is shown in Figure 1 for 100 randomly drawn probability sets. In Figure 2, the recognition is plotted vs. the square root of the M.I. This plot appears linear and has linear correlation of 0.98. The relationship between recognition and mutual information appears unmistakable for the artificial problem. Difficulties with the probabilities not strictly obeying the independence assumption are investigated in the infrared application. The seventy-eight pairwise questions generated from the thirteen classes of infrared spectra are used as an empirical ANALYTICAL CHEMISTRY, VOL. 48, NO. 7, JUNE 1976

1029

A

A

& A

A

*

0 0

0 2

Ob

0 1

I I O

0 8

MUTUAL I N F O R M A T I O N ( b i t s )

Figure 1. Plot of recognition vs. mutual information for a set of randomly selected two-dimensional classifiers

Figure 2. Plot of the data in Figure 1, recognition vs. square root of mutual information

Table VA. Recognition Results for Two Class Problems (in %) 1 2 3 4 5 6 7 8 9 10

11 12 13

1

2

3

4

5

6

7

8

9

10

11

12

94.8 89.5 96.5 92.8 97.2 95.8 95.0 90.0 91.8 97.2 94.0 97.2

86.5 99.8 95.2 99.0 95.8 93.5 93.0 97.0 91.8 96.8 97.0

96.2 89.0 95.8 93.8 90.0 85.8 92.0 89.0 92.8 95.2

98.8 93.2 87.8 90.2 93.5 95.0 87.0 95.0 95.5

97.2 93.5 92.2 92.5 92.0 91.5 95.2 96.2

88.0 93.0 86.2 89.0 95.5 93.0 97.2

84.5 88.2 92.0 89.0 90.5 94.5

89.5 93.2 83.8 94.2 94.0

79.5 90.5 92.8 96.0

94.5 91.0 94.0

93.8 97.0

91.7

7

8

9

10

11

12

0.72 1.20 1.60 0.91 1.38 1.73

1.65 2.10 0.80 1.65 2.00

0.74 1.52 1.37 1.82

2.30 1.74 1.82

1.47 1.91

1.61

________

13

~~

Table VB. Mutual Information for Two Class Problems (in bits) 1 2 3 4 5 6 7 8 9 10 11 12 13

1

2

3

4

5

2.17 1.11 2.55 1.67 2.57 1.98 1.96 1.45 1.78 2.00 2.10 1.89

1.01 1.71 1.63 2.61 2.13 1.61 2.24 2.56 1.34 2.52 2.29

1.48 1.28 1.89 1.40 1.32 1.22 1.55 1.07 1.88 1.74

2.05 1.38 1.15 1.37 1.81 2.22 1.19 2.40 2.08

2.36 1.86 1.55 1.95 1.85 1.86 2.44 1.79

6

0.98 1.74 1.14 1.25 1.55 1.55 1.97

test of the relationship between M.I. and recognition. In the training step, the data set is used to estimate the class conditional probabilities. The members of the training set are then classified in the recognition step. The recognition results for the 78 questions have been computed and are tabulated in Table VA. The entry in row i and column j indicates the percentage of (the 400) spectra in classes i and j that are correctly classified. The recognition results lie in the range [79.5-99.8961 which indicates that the classification algorithm is quite successful. Table VB presents the amount of M.I. (in bits) calculated for each question. The general format of the table is the same as in Table VA. Again, 1030

ANALYTICAL CHEMISTRY, VOL. 48, NO. 7, JUNE 1976

13

if the M.I. is related to the recognition ability, a high correlation is expected between the corresponding elements of the two tables. The linear correlation between the recognition and the square root of M.I. is calculated to be 0.83. For random, unrelated variables the probability of a correlation this high is