Information Theory and its Application to Analytical Chemistry

As analytical chemistry instrument technology rapidly increases, the author reflects on what is meant by "chemical information" and uses this refined ...
0 downloads 0 Views 5MB Size
lnformation Theory and Its Application to Analytical Chemistry D. E. Clegg GriffithUniversity, Nathan, 4111, Queensland Australia D. L. Massart Free University Brussels, Brussels 1090, Belgium The Changing Demands on the Analyst The professional mission of analytical chemists is to s u p ply clients with information aboufthe composition of Samples. The means for gaining this information has changed ~rofoundlvover the vears.from the wmmon ~~-use -~ of ~- "&t" -~ ehemical methods toihe almost complete reliance on physical instrumentation that is found in contemoorarv laboratories. Such changes have beenmade largeliin response to the more stringent demands made by clients, for example, requests for the determination of very low levels of anal* or for fast multielement analysis. Inevitably, the shiR to instrumental methods has also required analysts to augment their training in chemistry with an understanding of other fields such as electronics, optics, electromagnetism, and computing. However, the need to develop knowledge and skills in these diverse disciplines should not be allowed to obscure the distinctive task of the analyst: to supply chemical information about samples. The emphasis should he kept on gathering reliable information, rather than being shifted to any particular technique or methodology.

sages", where the term message is used in its widest sense. This range includes the faint molecular messages that emanate from the analyte in a sample, then pass through the electronic encoding and decoding processes of a spe&ometer.

~

The Evolving Concept of lnformation It would be useful and timely for analysts to reconsider in greater depth what is meant by "information"in general and "chemical information" in particular. Then their complex and rapidly evolving methodologies could be put in broader teleological perspective. Such matters have been the subject of much study since a theory of information was developed by Claude Shannon in the late 1940's to solve some problems in message communication over the primitive Morse Code transmitters then in use (1).The theory has been applied to a varied range of situations that involve the transmission of "mes-

Evaluating and Comparing Procedures The reliability of information passing through such a system depends on both the ability of the sender (the analyst) and the ability of the transmitter. The sender must correctly prepare the sample for encoding. The transmitter must convey the desired signal to the output stage. Also, this signal must be keot se~nratefrom the mvriad of other ~ signals emanating &om 'the sample extract and from within the instrumentation. Shannon's singular contribution was to recognize that it was possible to describe and compare the performance of various message-transmission systems in a much more meaningful way. He defined the term "information" to allow the amount of information passing through a system to he quantified. Workers in the relatively new field of "chemometrics" recognized the application of his theory to analysis. It could provide a measure of the performanee of an analytical procedure, expressed in terms of a common currencv. namely the yieldbf "information". Widely different me& odologies could thus he evaluated and comoared bvmeans of thi; single yardstick (2). Below we discuss how information is defined and how the basic ideas of information theory can be applied to analytical chemistry--from spot tests and chromatography to the interpretation of large chemical data bases. For pedagogical reasons the examples are taken from qualitative ~

Volume 70 Number 1 January 1993

~

19

~

-

analysis, but the concepts are equally applicable to quantitative analysis. ' Information is required to reduce the uncertainty that arises when several options exist and no one knows which is correct. As the number of options increases, so does the uncertainty and the amount of information needed to reduce it--or completely resolve it. Thus, in a sense, information and uncertainty are inversely related: Each is related quantitatively to the number ofoptions that exist. Below we use an example from a simple lab test to show how this leads to a useful quantitative definition of information (and uncertainty).

No. of Possible Outcomes (n)

No. of Tests Needed 0

1 = 2O

Quantifying Information

2 = 2'

1

4 = 22

2

8 =

3

The power to which 2 must be raised to give the number of possibilities n is defined as the log to base 2 of that number. Thus, information and uncertainty can be defined quantitatively in terms of the log to base 2 of the number of possible analytical outcomes.

The Simplest Qualitative Spot Test

Consider the use of a qualitative spot test to determine the presence of iron in a water sample. In terms of information, there are only two possible options: iron either is or is not present above the detection limits for the test. (Complications that might arise near the detection limit will be ignored.) Without any sample history, the testing analyst must begin by assuming that these two outcomes are equipmbable. This situation is summarized as follows.

where I indicates the amount of information; and H indicates the amount of uncertainty. The initial uncertainty can also be defined in terms of the probability of the occurrence of each outcome. For example, by referring to the probabilities in the table, the following d e f ~ t i o ucan be written.

Outcome

0 (Fe is absent)

1(Fe is present)

Probability

V2

V2

where I is the information contained in the answer, given that there were n oossibilities: H is the initial uncertaintv resulting from the need to consider then possibilities; and D is the orobabilitv ofeach outcome if all n oossibilities are

This describes the most basic level of uncertainty that is possible: the choice between two equally likely cases. Such a situation is assigned a unit value of uncertainty. The value of the information contained in the outcome of this analysis is also equal to 1 if the uncertainty is completely removed. This unit of information is called the "bit", as used in binary-coded systems. (Clearly, the situation with only one possibility has no uncertainty, and thus can yield no information). Then only one test, which must have two distinct obsewable states, is needed to complete this analysis. For example, after oxidation, the addition of ferrocyanide reagent either will or will not produce the characteristic color of Prussian blue. Thus, an analytical task with 1unit of uncertainty can be completely resolved by a test that can provide 1unit of information. Increasing the Possibilities

When up to two metals may be present in the sample solution (e.g., Fe or Ni or both) there are four possible outcomes, ranging from neither being present to both being present.

Nonequal Probabilities

The expression can be generalized to the situation in which the probability of each outcome is not the same. If we know from past experience that some elements are more likely to be present than others, eq 2 is adjusted so that the logarithms of the individual probabilities, suitably weighted, are summed.

where

Thus, we can consider again the original example, except that now past experience has shown that 90% of the samples contained no iron. This situation is summarized as follows. Outcome

Probability Probability

U4

U4

V4

U4

Which of these four possibilities turns up can be determined using two tests, each having two observable states. Similarly, with three elements there are eight possibilities, each with a probability of 118 (i.e., l/z3). Three tests are then needed to resolve the question. Computing Information and Uncertainty

The following pattern clearly relates the uncertainty and the information needed to resolve it. The number of possibilities is expressed in powers of 2, more particularly, to the power of 2 itself. 20

Journal of Chemical Education

0 (noFe) 0.9

1 (Fe) 0.1

The degree of uncertainty is calculated using eq 3 as H=-(0.9 log 0.9 + 0.1 log 0.1) bits = 0.468 bits The amount of uncertainty in this case is substantially less than that in the original equiprobable case (1bit). In commonsense terms. if we know beforehand that 90% of the samples contain no imn, our uncertainty about the outcome is not as meat. Thus. the value of the information in our report willnot be as b e a t either. In general, the greatest uncertainty is associated with the most-even distribution of prior probabilities. Consequently, the information content from reporting the outcome of such a situation is also at a maximum. Conversely,

when past analyses show that certain outcomes are much more likely than others, the level of information in the report is correspondingly lower. Comparison of Analytical Systems

The quantitative nature of the defmitions for information and uncertainty given previously provides a useful method to assess and compare different analytical methods. Methods can be compared in terms of their ability to generate information that reduces uncertainty about the composition of samples. A number of papers have described this comparison for specific cases (3-6). Below we use thin-layer chromatography (TLC) as a conceptually simple analytical situation that illustrates how the defmition is used to do this. Calculations fora TLC Test of Many Species

Consider the use of TLC to identify drugs that are used therapeutically or in abuse. Suppose the compound to be identified is known to be one of a library of no compounds in which the probability of occurrence is the same for each. Thus, the a priori probability for each is

Then, by definition, the uncertainty before TLC analysis for any one the no substances will be described by the following equation.

Suppose that a spot is found at position Rti after developing the TLC plate. It is known that any one of n;of the no substances can be found here. Then the probability has been increased to

The two observations are not equiprobable. The first result will be found only 2 times out of 20; the second will be found 18 times. Thus, to obtain the average information obtained from the observation, we must calculate a weighted mean.

go ) [ii j

I = -x 3.32 + - x 0.15 = 0.47bit

This can be written as I = - @+ log, pi)

- @- logzp 2

where pf is the probability of finding an Rf between 0.19 and 0.25; andp- is the probability of fmding another value. Consider not just two categories of results but n categories. For example, wnsider an Rfin the range (0-0.05) or (0.05-0.10), etc. Then

where n is the number of categories; and xi is the ith category. Evaluating Different TLC Systems

The following example shows how information theory can be used to compare the "quality" of two such chromatographic methods. TLC system 1permits the separation of 20 substances into five classes of four. TLC system 2 permits their separation into one class of 10, one class of seven, and three classes of one. Which is the best system?

Thus, the uncertainty about identifieation is reduced to

The difference in these uncertainties results from the information derived from the analytical procedure. Thus,

System 1permits the largest reduction inuncertainty In other words, system 1 can provide, on an average, more analytical information. However, for the three substances that are uniquely classified in system 2, this system would be better. This example illustrates again that the greatest uncertainty is generally associated with the moseeven distribution of prior probabilities. Consequently, the information content from reporting the outcome of such a situation is also at a maximum. System Optimization

To take a numerical example, wnsider a screening test for 20 possible substances that are all equally likely to occur, in which two have anRf between 00.9 and 0.25. (Due to the precision of this TLC method, substances with Rjs outside the range (0.19-0.25) cannot be distinguished.) The following equation gives the information obtained from the procedure when a spot is observed with anRf in the range (0.19-0.25). 20 Ii = logz-= 3.32 bit 2

Then the information obtained when no spot is observed with an Rf in the range (0.19-0.25) is given below. 20 I; = logz-= 0.15bit 18

Besides using information theory to wmpare the quality of chromatographic systems, we can use it for method development to optimize such systems. In high performance liquid chromatography (HPLC),the composition of the mobile phase must be optimized. This is frequently done with experimental design methods. For any formal optimization method, a criterion is needed. Usually, it is some function of chromatographic resolution. However, as shown by Siouffi(71,it is also possible to use the information wntent of a chromatogram, computed as above, as the criterion. Combining Qualitative Analysis Systems

Because chromatography is a poor identifier, analysts have sought to reduce the risk of misidentificationby using a combination of chromatographic separations carried out Volume 70 Number 1 January 1993

21

on the same sample mixture. How such procedures increase information can also best be illustrated with TLC.

Rf (solvent 3)

Increasing the Informing Power

Consider the problem of separating eight substances by TLC. If the analytical task is to identify which one is present in a sample, then the system must assign a charaderistic Rfvalue to each of the eight possible substances. If it fails to separate two or more of them, a second or even third TLC system must be added until enough "informing power" is acquired to completely resolve the uncertainty. The table lists the (hypothetical)Rf values of the eight substances in four different solvents in a TLC svstem that can distinguish spots measured at 0.1to 0.8,in intervals of 0.1.Solvent 1 represents the ideal situation in which each substance bas adifferent Rfvalue. In terms of information, the sample has an initial uncertainty of 1082 8 = 3 bits because there are eight possibilities. The maximum information that the TLC system can deliver is also logz 8 = 3 bits because it can produce eight different "signals", as seen with solvent 1.

u 1

I

Rf (solvent2)

Rf (solvent 4)

Combining Solvents

In reality, this parity between sample uncertainty and system information does not happen very oRen. We are more likely to encounter the situation shown by solvent 2, which gives us some information but not enough to guarantee identification. It fails to distinguish between the pairs AB, CD, EF, and GH. It delivers logz 4 = 2 bits of information, and we need 3 bits. Solvent 3 can only separate the substances into two groups (1 bit). However, when solvent 3 and solvent 2 are combined, all eight substances can be identified separately. Solvent 4 also gives only two spots, but combining solvents 4 and 2 will not yield identification of all substances. Correlations

When two systems are combined, simple addition of the information is possible only when the systems are uncorrelated. That is, for systems A and B,

where pas is the correlation coefficientbetween Aand B. Correlation always reduces the signal "space" that is available to the substances present. Thus, in the above case, a combination of two solvents produces a two-dimensional space with a potential array of 8 x 8 = 64 distinct signals, which is equivalent to a maximum of 6 bits of information. Thus, the system can ideally recognize 64 different Rf combinations. Rt Values

Substance

Solvent 1

Solvent 2

Solvent 3

Solvent 4

A B

0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 3

0.20 0.20 0.40 0.40 0.60 0.60 0.80 0.80 2

0.20 0.40 0.20 0.40 0.20 0.40 0.20 0.40 1

0.20 0.20 0.20 0.20 0.40 0.40 0.40 0.40 1

C D E F

G H

Information (bits)

22

Journal of Chemical Education

u 1

Rf (solvent 2)

Uncorrelated and conelated R, values and the effect on signal "Space". If correlation occurs, the actual combinations become restricted to a limited zone that becomes more narrow as p,* approaches 1. In other words, the chance that two substances have the same analytical "signal" increases as this zone narrows, thus reducing the information-generating capabilities of the combined system. The example described above is illustrated diagrammatically in the figure. Classification Using Expert Systems

The application of information theory for so-called machine learning was proposed by Quinlan (8).Machines can "learn" by so-called inductive reasoning, and inductive expert systems are now available to "teach" them. Before explaining how information theory plays a role in such expert systems, it may be necessary to clarify the difference between deductive and inductive expert systems. Deductive Systems

The deductive expert systems are more usual and thus better-known. Their knowledge base consists of rules that have been entered by experts. The system uses these rules by chaining them to reach a conclusion. For example, we can reach the couclusion below by combining the following two rules. Ifa substance mntains more than 20 carbon atoms, then it should be considered apolar. If a substance is apolar, then it should be soluble in methanol. Thus, substance X,with 23 carbon atoms, should be soluble in methanol.

lnductive Systems

Inductive expert systems work from examples. An inductive expert system that was designed to classify the solubilities of organic substances would comprise the following. a set of substances a selection of their pmperties, such as carbon number, functional group, etc. a set of solvents known to dissolve the substances When a new nubstance is presented to the system, the system would rrive advice based on rules or analogies that it has derived-from these examples. In other words, while deductive systems need rules given by experts, inductive systemsmake rules that are based on the examples supplied. Qninlan's Id3 algorithm, which is based on -inform&on theory, is the best-known algorithm for such inductive learning.

-

Usina Information Theow with lnductive Systems The use of information theory for inductive expert systems can be shown in the use of chromatographic analysis for food authentication. Consider, for example, the problem of determining the origin of a particular unkown olive oil sample. The determination could be based on the fatty acid composition of known olive oil samples that originated in two growing reaons, East and West Liguria in Italy (91. ~

where H is the initial uncertainty, the maximum possible in this case. Let us also assume that the test requires determining whether or not some property of the oil exceeds a threshold value. For example, say we want to determine whether or not the oleic acid percentage exceeds 15. If so, the test is ~ositive(+). If not. the test is ne~ative(4. Finally, in order to carry out the following calculations, we assume that we already know the following about the samples, The test shows that five of the eight samples are above 15%, and thus positve. These five positive samples are also 3 E and 2 W. The remaining three samples test negative. These three positive samples are 1E and 2 W. The question then arises "How much information would be obtained, on an average, when an analyst applies this test to a sample from this batch of eight?" If the result is >15%,then the following would be the information still required after the test to resolve the remaining uncertainty.

If the result is less than 15%,it would be

Choosing the Rules

In the inductive learning process a set of rules is developed from the patterns of fatty acid levels in each of the oils. Thus, "information" is the criterion for choosing the rules that have the greatest power of discrimination. Suppose there are eight oil samples from East (El and West (W) Lignria. Then the task becomes determining the origin of a sample based on the percentage of one or more fatty acids. In deciding whether a sample is a W or an E, the required information (or the initial uncertainty before testing) will be the following. Hb =-@(W) logzp(W))- @(Ellogzp(E)) If it is known that there are four unknowns from each category, then the a priori probability is 0.5 for each category Using eq 1,we conclude that Ha = 1.For other a priori probabilities, other Hb values are obtained. Thus, we get the following table. Uncertainty Number Prob. 4 W 4E 3 W 5E 2 6~

w

OW8E

(0.5 10.315 i0.25 (0

W

W W W

0.5 E) 0.625 E) 0.75 ~j 1 E)

Hb=

l

Hb= 0.954 Hb= 0.81 Hb=O

bit bit bit bit

These numbers can be understood, for example, if we know that the situation 0 W 8 E does indeed require no information. We know that all samples are E. Thus, we do not reqnire information to determine the origin of one of these samples. The situation 0.5 W 0.5 E is the most uncertain. We reqnire more bits than in the other situations in which the a priori knowledge is greater. The Effect of Test Results

Let us now go a step further and see how the required information (or initial uncertainty) is affected by a particular test result. For test 1,let us assume the worst situation, in which there is a 50:50 chance that the oil is from E or W. Thus, p=0.5 and H = l

Thus, the weighted average residual uncertainty covering all possible outcomes of the test would be H

-

a-[:

]+ 6

- x 0.971

]

- x 0.918 = 0.951 bit

because we know the distribution of oils: five samples above the threshold and three below. Thus, the information generated by test 1is the difference between the initial and final amounts of required information (or uncertainties).

This, of course, is very little. A test with this threshold value is not very useful because it does not yield an appreciable amount of information. An Improved Test

Then let us consider another test in which there are again five positive and three negative samples. This time all the negative samples are E, and the positive results are from 4 W and 1E samples. The information still required after the test is shown below. If the test result is positive, then calculating as above, H+= 0.70 bit If it is negative, we get

because a negative test means that the sample can only be an E sample. Thus, the average residual un&rtainty after performing test 2 is Volume 70 Number 1 January 1993

23

Four such rules were derived that allowed correct classification in over 90% of all cases. Another typical application was reported by Hopke and Mi (10).They used such a n expert system to develop rules that allow classification of individual particles collected during air-pollution studies. The particle's composition was analyzed using a scanning electron microscope with X-ray detection (SEM/XRF). In general, this approach is useful when much data is generated and rules must be generated to make classifications or decisions.

Thus, the information obtained from the test is I = 1 - 0.44 = 0.56 bit

Clearly this test is better than the first one. Quinlan's Algorithm

Quinlan's algorithm is sequential. The first variable it selects is the variable that yields the most information. For continuous variables, the algorithm must find the threshold at which that variable is most informative. This is then used to split the sample in two. If the second test was found to be the best. then this test would he selected, as done in the previous example of testing for E and W. Conseauentlv. ".the followine rule would he created. ~

~~~

~~~

~

-

IF test 2 = (-1

THEN the sample = E

A positive test in this example is not conclusive, and an additional test would be selected to further divide the group in two, until the only groups remaining comprise a single category. In this simple example the five remaining unseparated samples are 4 W and 1 E. The resulting complete set of rules might appear as below.

Conclusion Information theory has made a special contribution to analytical chemistry: It provides a fundamental criterion for assessing the progress made after analvsis in decreasing the initial unce;tainty surrounding sample. This could mean an increase in qualitative information about sample composition or a bettkr understanding of its classification among a group. From the philosophical point of view, this criterion has to auantitative and the virtue of being equally . applicable .. qualitative spherrs. l'rotocols in analysis already include familiar and well-proven statistical parameters. However, this is not so for qualitative analysis, in which concepts such as specificity and selectivity still tend to he vague and poorly understood. Also, in qualitative analysis the analyst must decide how to quantify the merits ofalternatives such as theT1.C svstemsor color ~- tests in our simple examples. I n dealing with aualitative uncertaintv. ", the lack of - - an -~~~ accessible cTitenonof performance often leads to analytical "overkill" tn cover any risk oferror in identification. Informat~ontheory offersanalystsa way ofmoving moreeasily within this aualitative dimension oftheir work. It " mves them a rationai basis for more precisely and economically matching the analytical needs of the client to the methodology.

o

~~

~

IF test 2 = (-) IF test 2 = (+) and IF test 6 = blue IF test 2 = (+) and IF test 6 = red

THEN sample = E (3) THEN sample = E (1) THEN sample = W (4)

Test 6 is the best of several tests investigated for separating the 4 W and 1E. Of course, the example given above is only an example to explain the methodol&y Computer Programs and Automation

I n practical situations, one needs computer programs. For example, EX-TRAN was used to fmd decision rules to separate E and W oils that were characterized by their fatty acid content. Seven fatty acids were used (9). The rules developed by the program take the following form. If the linolenic acid content is found ta be less than 25 and the linolie acid is less than 665.0, then sample is E.

24

Journal of Chemical Education

~~~~~~

~~

Literature Cited