Application of Information Theory to Analytical Chemistry. Identification by Retrieval of Gas Chromatographic Retention Indices Foppe Dupuis and Auke Dijkstra Analytisch Chemisch Laboratorium, State University of Utrecht, Croesestraat 77A, Utrecht, The Netherlands
By using Shannon’s formula, amounts of information have been calculated for 10 stationary phases used in GLC. The data sets used were taken from McReynolds. The amount of information per column was 6.5-7.0 bits depending sllghtly on the stationary phase. Taking Into account correlations between the retention indices, the information for the 10 columns together amounted to 43.3 bits. The amount of lnformation as a function of the number of columns used is dependent on the sequence of the columns chosen. A procedure was developed for sequencing the columns to obtain the maximum increase of Information when adding a column. The optimal sequence of the first four columns covers one apolar and three (rather) polar stationary phases.
In order to solve an analytical problem, information has to be produced. To gather the required (amount of) information, analytical methods are being used. Different analytical methods yield different amounts of information. Knowledge of these amounts might facilitate the choice of the most suitable method for the problem to be solved. Information theory gives the tools to assess these amounts. Several authors have applied the principles of information theory to analytical chemistry. Kaiser ( 1 ) related the informing power of an analytical method to the signal-tonoise ratio and to the resolution. He discussed the information provided and required. Eckschlager (2-5) and Doerffel and Hildebrandt (6) also contributed to this discussion. Griepink and G. Dijkstra ( 7 ) applied the principles of information theory to elemental analysis and Grotch (8) calculated the information content (amount of information) of mass spectra. Huber and Smit (9) and Palm (IO)discussed some aspects of the information obtained from gas chromatograms in connection with the data handling. Massart ( I 1 ) used the information as a parameter in choosing the best mobile phase in thin layer chromatography. The same author (12) used the informing power as a quality criterion in chromatography. The study presented here deals with the application of information theory to gas-liquid chromatography, in particular with the identification of unknown compounds by matching the retention indices with values compiled in a library (identification by retrieval).
AMOUNT OF INFORMATION Producing information can be considered equivalent to the reduction of the uncertainty with respect to the composition or identity of the sample to be analyzed. Before the sample is analyzed, usually some preinformation is available. In quantitative analysis, the percentage of the compound to be analyzed is always between 0 and 100%. In qualitative analysis, the compound to be identified always belongs to a smaller or larger group of compounds. After analysis, there usually remains an uncertainty due to experimental errors or to imperfections in the procedure of translating the measurements into the analytical result.
In gas-liquid chromatography, the identity of a compound can be established from one or more retention indices by using a retrieval procedure. Such an identification yields an amount of information that depends on the group of compounds and the experimental error. Calculation of the amount of information will be possible if the uncertainties before and after analysis can be expressed in a quantitative way. In the case of an unknown compound belonging to a class of N compounds and establishing the identity with 100%certainty, the information obtained equals
I = 1dN (bit)
(1)
where Id stands for dual logarithm. If a compound i is likely to be found with a probability p i , Equation 1 has to be replaced by N
I = i.1
p i Id p i
In case of identification by retrieval, the uncertainty before analysis can be expressed in terms of the distribution function P , of the indices of the group of compounds, provided that all these compounds are equally likely to be found. The uncertainty remaining is directly related to the error distribution function P,. Using Shannon’s equation (e.g., 13) in integral form, the information obtained equals
with
L:P(x)dx = 1
and where x represents the retention index. The integration boundaries R1 and R2 represent the lowest and highest retention indices in the library. I(1)is the information obtained from the index measured on one stationary phase (column). If the distributions P , and P , are normal (Gaussian),Equation 3 reduces to
where urnz and ue2 are the variances of the functions P , and P,, respectively. The index G indicates that the distribution function used in the calculation of the information is normal (Gaussian). Since urn2is composed of the variance ur2 of the true index values with distribution P , and the variance of the error distribution function P,, Equation 4 can be written as
1,(1) = 12 Id (1
+
u:/u,~)
(5)
Usually 0 , is small as compared to or and thus u, approximates u,. If the amount of information obtained from one index measurement is not sufficient, retention indexes ANALYTICAL CHEMISTRY, VOL. 47, NO. 3 , MARCH 1975
379
measured on other columns can supply additional information. In order to calculate this information, the n-dimensional probability function P ( x 1,x 2, . . . , x n ) must be known. Equation 3 has to be replaced by
P,( X I , ~ 2 ., . . ,x,) Id P,(
~, ~j
. . .,x,) dxi
2 ,
dx2 . . . dx,
ly limited. A large number of indices is required to construct a n-dimensional histogram that sufficiently approximates the n-dimensional distribution function.
SEQUENCING THE STATIONARY PHASES If a retrieval procedure is to be designed, it is important to choose a set of columns that will yield the maximum amount of information. According to Equation 7, the largest amount of information is produced by the set of columns for which the determinant of the covariance matrix is maximal. For the choice of the optimal set of n columns from a total of m available, ( 7 ) determinants have to be calculated. As the calculations involved are elaborate this method of selection can be economically employed only when the number of columns is low. A more economic-with respect to computer time-selection procedure consists of choosing one of the columns (e.g., the column yielding maximal information) as the first column. The information obtained with this column is indicated by I 1. The column to be added to the first in order to obtain a maximum amount of information for the first and second column together, can be selected by using a selection criterion composed of the information for the second column (Zz) and the correlation between the retention indices for both columns. In this study, the selection criterion I2 (1 1 ~ 2 1 ) is ) used. The column for which the value of this criterion is maximal is added to the first. In this formula pzl is the linear correlation coefficient, and it is defined as
(6)
In case of normal distributions, Equation 6 can be converted to
ZG(1,2,., . ,n) = 1. Id 2
1c0vlrn lcov/,
where /cod is the determinant of the covariance-matrix of that of the errors. The cothe retention indices and 1~04, variance-matrix is defined as follows
'On1
Om/
where u,, = (r,* is the variance of the retention indices on column i and (r,, is the covariance of the indices for the columns i and j. If no correlation exists between the indices measured on the different columns, and thus all ul, equal zero, the covariance matrix reduces to a diagonal one. Then the total information equals the sum of the informations obtained from the columns separately. The amount of information obtained from a retrieval procedure can never exceed the information calculated from Equation 1 with N equal t o the number of compounds contained in the library. If the amount of information is to be assessed for a very large library, Equations 2 to 7 can be used with P-values derived from a representative sample of retention indices. To this end, the range of possible indices is divided into m intervals of equal length Ax. If the probability of finding an index in the interval i is given by p L ,the integral Equation 3 can be replaced by
0 21
= Jo12g22
The absolute value of p21 is always equal to or smaller than 1. In a similar way, a third column is added to the two columns already selected. The correlation coefficient to be used is an average of the correlation coefficient of the indices for the third with the first column and that of the indices for the third and second column. So the criterion will be in formula
and the column for which this criterion is maximal, is selected as the third one. In general the k t h column is selected by using the criterion
i=l
where f i e is a constant due to the normal distribution of the errors and with rn
cpi
= 1.
i=l
The index H in I H indicates that the information has been calculated from a histogram. The term Id Ax is to be regarded as a correction applied to
in order to approach the information that would be obtained from the true (continuous) distribution of the retention indices. In practice the number of intervals -is chosen to be equal to v%, where N is the number of compounds in the sample. A minimum of 6 or 7 intervals and thus at least 40-50 compounds are required for the calculations. Theoretically also, a n-dimensional formula analog to Equation 9 can be derived, but in practice its use is severe380
Thus, without knowledge of the covariance determinant, a sequence of columns is found. This sequence usually is not the optimal one and depends on the choice of'the first column. Another weighting function of I and p might produce another sequence. The sequence of the columns obtained in the manner as described above can be optimized without extensive use of computer time. This is done by composing the determinant of the covariance matrix according to the sequence of the columns. Next the determinant is converted into a determinant with the elements in the lower left triangle equal to zero by applying the first step of Gauss' elimination method ( 1 4 ) . The determinant then obtained has the same value as the original one and equals the product of the diagonal terms. The value of the n th diagonal term of the converted determinant depends only on the values of the variances and covariances with indices smaller than n, but is independent
ANALYTICAL CHEMISTRY, VOL. 47, NO. 3, MARCH 1975
__ Table I. List of Stationary Phases
Table 11. Standard Deviations and Amounts of Information
Column No.
Stationary phase
1 2 3 4
Squalane Apiezon L SE-30 Diisode cylphthalate Poly(pheny1 ether-6 Ring) Bis (ethoxyethy1)phthalate Carbowax 20 M Diethyl glycol succinate Tricresyl phosphate Diglycerol
5
6 7
8 9 10
Amount of information
Standard deviation, Column No.
1 2 3 4 5 6 7 8 9 10
-
of the values of the variances and covariances with indices larger than n. Thus, the information obtained from the first n columns is equal to the product of the first n diagonal terms of the converted determinant (apart from the correction for the experimental errors, Equation 7 ) .Apparently, the optimal sequence of columns should produce a determinant with the values of the diagonal terms decreasing from upper left till lower right. If this appears not to be so, the sequence of columns can be changed by rearranging the diagonal elements according to their decreasing value. In fact, this requires an interchange of columns and rows of the determinant corresponding to a change of the sequence of the chromatographic columns. The sequence now obtained is used to again fill the covariance matrix and the Gauss elimination method is applied once more. If this yields a converted determinant with the diagonal terms with decreasing value, the sequence is regarded as the optimal one. If necessary, more rearrangements are required.
80
5,
IH(l)
I G(1)
193 195 193 193 192 200 221 256 197 245
6.5 6.5 6.5 6.5 6.5 6.7 6.8 7 .O 6.6 6.9
6.6 6.6 6.6 6.6 6.6 6.7
6.8 7 .O
6.7 7 .O
I
60
40
20
0 Retention index
RESULTS
Frequency distribution of retention indices measured on Carbowax 20 M. Sample from McReynolds (15)
Figure 1.
The distribution function of the gas chromatographic retention indices was estimated from a sample of indices of 248 compounds for 10 stationary phases, taken from the As has been indicated, the information obtained from n columns practically cannot be estimated from a n- dimencompilation of McReynolds ( 1 5 ) .The stationary phases are listed in Table I. Each set of 248 indices was divided into 9 sional histogram. Judging from the small differences beintervals. Equation 9 was used for the calculation of the in) I G ( ~a) reliable , value of the amount of intween I H ( ~and formation to be obt,ained on identification using a library formation can be obtained by using Equation 7. The sequence of columns found by the selection procedure dewith a distribution of indices equal to that of the library scribed is given in Table IV. The corresponding amounts of used. The values are listed a s I H ( ~in) Table 11. For each information are listed in Table V. The differences with the set of indices, the estimation of the standard deviation (s m) was calculated. Assuming that the distributions are apvalues in Table I1 are due to roundoff errors in the computproximately normal, the values of s ,, can be used for a caler calculations. The total amount of information to be obculation of I G ( 1 ) with Equation 4. Values of s m and I G ( ~ ) tained with 10 columns equals 43.3 bits. are also listed in Table 11. A value of 2 index units for the In Figure 2 the amount of information as a function of standard deviation (re was used. the number of columns has been plotted in two ways, withThe differences between I H ( ~and ) I G ( ~are ) small. In out and with correlation. The increase in information when Figure 1, a histogram and a normal distribution with the adding a column, gradually decreases as the number of same standard deviation is plotted for column 7 . The retenphases increases. All calculations were performed on the tion indices for the different columns are correlated as can IBM 360/65 of the Computer Centre of the University of be seen from the values of the correlation coefficients comTechnology, Delft. Programs were written in FORTRAN and piled in Table 111. the IBM Scientific Subroutine Package was used. Table 111. Linear Correlation Coefficients of the 10 Sets of Retention Indices Taken from McReynolds Column ho.
1 2
3 4 5 6 7
8 9 10
.
2
3
4
S
6
7
5
9
10
1
0.99
0.99 0.99
0.97 0.98 0.99
0.93 0.93 0.95 0.97
0.87 0.88 0.91 0.96 0.96
0.71 0.73
0.56
0.90 0.91 0.93 0.97 0.96 0.99 0.91 0.84
0.32 0.33 0.38 0.49 0.53 0.68 0.79 0.86 0.64
...
... ... ... ... ... ... ... ...
1
...
...
...
...
...
...
... ...
1
... ... ... ... ... ...
...
1
... ... ... ... ...
...
1
...
... ...
... ...
1
... ...
... ...
0.76
0.83 0.85 0.93 1
...
...
...
0.58 0.62 0.72 0.76 0.86 0.91 1
... ...
1
...
1
ANALYTICAL CHEMISTRY, VOL. 47, NO. 3, MARCH 1975
381
Table IV. Sequence of Columns Determined by the Selection Procedure Start with
Sequence of columns
column KO.
1 2 3 4 5 6
1 1 1 1 1
0 1 0 2 0 3 4 0 0 5 6 1 0 7 7 1 8 8 1 9 9 1 0 1 0 1 0 1
8 7 5 9 8 7 5 9 8 7 5 9 8 7 5 1 8 7 1 9 7 8 1 5 1 0 8 5 9 1 0 7 5 9 8 7 1 5 8 7 5 9
2 1 1 9 2 9
4 4 4 2 4 2 2 4 2 4 2 4 2 4
3 3 2 6 3 4 3 3 6 3
6 6 6 3 6 3 6 6 3 6
50
-
40
-
30
-
20
-
10
-
DISCUSSION The amounts of information obtained on identification by retrieval of gas chromatographic retention indices have been calculated with an error of the index of 2 index units (standard deviation). A reliable estimation of the errors for the data taken from McReynolds is not available. It should be noted that a decrease of the standard deviation by a factor 2 increases the amount of information with 1 bit for each column used. The question arises whether the values for the amounts of information will be influenced by the sample of indices chosen for the calculations. I t is difficult to check whether the sample is representative for the total population of compounds accessible for identification by gas chromatography. A definite proof of this representativity cannot be given. Nevertheless, it was felt that the 248 compounds selected are a good cross-section of the entire population. The calculations were based upon the assumption that the distributions can be approximated by a Gaussian one. As can be seen from a comparison of the values and ZH(I), the results of the calculations are relatively insensitive to the exact shape of the distribution function. A support for this was obtained from calculations using the same standard deviations and assuming a rectangular distribution function. A maximum difference of only 0.3 bit per column was observed. This indicates that the values given may be in error to the extent of not more than 0.5 bit. In general, the amount of information might be used as a parameter (probably together with other parameters) to select the optimal analytical procedure. This study deals only with the information obtained from a retrieval procedure for identification by gas chromatography. Studies on mass spectrometry and infrared spectrometry are in progress. From this study, only conclusions can be drawn for problems that are to be solved by gas chromatographic retrieval procedures, in particular to the number of columns and nature of stationary phases to be used. The results presented
v . 0
1
2
.
&
1
4
-
'
.
6
'
8
.
I
10
N u m b e r of columns Figure 2. Amount of information as a function of the number of coiumns
in this study can only be applied to libraries with retention indices that have the same distribution function as that of the 248 compounds used. Considerable differences can be observed when an entirely different population is used. Preliminary studies indicate that for the same set of ten columns, an amount of information of 29.7, 25.1, and 28.6 bits is obtained when the sample of compounds consists of alcoholes, esters, and aldehydedketones, respectively. In addition to this, it should be noted that the figures for the information quoted are to be interpreted as averages. An information of 10 bit means that on the average one out of the 21° compounds can be identified. Because of the uneven distributions about 1 bit extra per column has to be produced. From the results, it is clear that the obvious combination of polar and apolar columns produces the best results. The first four columns cover one apolar and three (rather) polar statibnary phases. In summarizing the results obtained so far, it appears that the concept of information as introduced by Shannon can be useful when exploring the possibilities of a retrieval system for a set of compounds. A sample of at least 40 compounds, representative for the whole set of compounds to be retrieved, has to be chosen. The standard deviations of the distribution of the retention indices on the several columns has to be calculated. For the same set of compounds, the covariances for the different pairs of columns are also
Table V. Amounts of Information (Bits), Corresponding to Sequences of Table IV
382
F i m t column
1
2
1 column 2 columns 3 columns 4 columns 5 columns 6 columns 7 columns 8 columns 9 columns 10 columns
6.8 13.2 18.9 24.0 28.3 31.9 35.1 38.1 40.7 43.3
6.8 13.3 18.9 24.0 28.3 31.9 35.1 38.1 40.7 43.3
3
6.7 13.2 18.7 23.8 28.1 31.7 34.7 37.7 40.7 43.3
4
6.7 13.1 18.5 23.5 27.8 31.6 35.2 38.3 40.9 43.3
ANALYTICAL CHEMISTRY, VOL. 47, NO. 3, MARCH 1975
J
6
I
6.6 13.1 18.5 23.6 28.4 32.0 35 1 38.2 40.7 43.3
6.6 13.0 18.0 22.8 27.5 31.8 35.2 38.4 41.0 43.3
6.8 13.0 18.6 24.0 28.4 31.9 35.1 38.1 40.7 43.3
8
6.9 13.1 18.9 24.0 28.4 31.9 35.1 38.1 40.7 43.3
9
6.6 13.0 18.2 23.3 27.9 32.2 35.4 38.2 40.8 43.3
10
6.9 13.2 18.8 24.0 28.6 32.2 35.3 38.1 40.7 43.3
required. An estimate has to be made for the number of columns to be used for the retrieval system, e.g., 21° (= 1000) compounds retrieved require 10 bits of information. If only 1 column is needed for the retrieval, the column yielding the maximum amount of information is selected. For a design of a retrieval system with more chromatographic columns, the optimal choice can be made by choosing the combination of columns yielding a maximum value for the covariance determinant. I t is advisable to use this procedure when 2 or 3 columns have to be chosen from a set of not more than 20 columns. If more than 2 or 3 columns have to be selected from a large set, it is more economic (with respeclt to computer time) to select the columns by choosing first the column with maximum information, adding the next columns by using the weighting factor and optimizing the sequence by applying the procedure as described in the paragraph on sequencing the columns. A final remark on the concept of information should be made. More information corresponds with a broader distribution of the retention inaices, I e., the standard deviation is larger. This indicates that, on the average, separations are better on columns producing more information. The user of the concept of information as proposed in this paper should be aware that the conclusions drawn are always referring to averages. An elaborate discussion on the correlations between the retention indices will not be given here. It is only observed that the correlations correspond t o the sequence of stationary phases used by McReynolds ( 1 6 ) and based upon the Rohrschneider concept of polarity ( 1 7 ) . Taking for instance the correlations with respect to squalane, an increase in polarity corresponds t o a decrease in correlation. Although the amounts of information added when adding more columns decreases when the number of columns increases, the curve of Figure 2 does not reach a maximum. Apparently retention indices are to be described by a t least 10 (linear) factors. A factor analysis according to Malinowski ( 1 8 ) applied to the data used in this study are in accordance with this finding. Several authors (19-22) conclude that the retention in-
dices can be described by not more than 8 factors. However, the sets used by the authors comprised less and less different, compounds, and thus a direct comparison is impossible. The results of a factor analysis applied to gas chromatographic data by Weiner and Parcher (23) indicates that a choice of columns can be made when the factors are known. A comparison of the results presented here and those of Weiner and Parcher is difficult because of the different sets used in these studies.
ACKNOWLEDGMENT The authors are indebted to J. H. Kelderman and G. van Marlen of the Technical University of Delft, The Netherlands, for their helpful discussions.
LITERATURE CITED (1) H. Kaiser, Anal. Chem., 42, (2).24A (1970). 12) K. Eckschlager, Collect. Czech. Chem. Commun., 36, 3016 (1971). (3)K. Eckschlager, Collect. Czech. Chem. Ccmmun., 37, 137 (1972). (4)K. Eckschlager, Collect. Czech. Chem. Commun., 37, 1486 (1972). (5)K. Eckschlager. Collect. Czech. Chem. Commun., 38, 1330 (1973). (6) K. Doerffel and W. Hildebrandt, Wiss. 2. Tech. Hochsch. Chem. "Carl Scharlemmer" Leuna-Merseburg, 11 (l),30 (1969). (7)B. Griepink and G. Dijkstra, fresenius' Z. Anal. Chem.,.257, 269 (1971). (8)S.L. Grotch, Anal. Chem., 42, 1214 (1970). (9)J. F. K. Huber and H. C. Smit, fresenius'z. Anal. Chem., 245, 84-88 (1969). (10) E. Palm, Fresenius'Z. Anal. Chem., 256, 25-27 (1971). (11) D. L. Massart, J. Chromatogr., 79, 157 (1973). (12) D. L. Massart and R. Smits, Anal. Chem., 46, 283 (1974). (13)C. E. Shannon and W. Weaver, "The Mathematical Theory of Communication," The University of Illinois Press, Urbana, Ill., 1949. (14)D. K. Faddeev and V. N. Faddeeva, "Computational Methods of Linear Algebra." W. H. Freeman & Co, San Francisco-London, 1963. (15) W. 0.McReynolds, "Gas Chromatographic Retention Data," Preston Technical Abstracts Company, Evanston, ill., 1966. (16)W. 0. McReynolds. J. Chromatogr. Sci., 8, 685 (1970). (17)L. Rohrschneider, J. Chromafogr., 22, 6 (1966). (18)E. R. Malinowski, Separ. Sci., 1, 661 (1966). (19)P. H. Weiner and D. G. Howery, Can. J. Chem., 50, 448 (1972). (20)P. H. Weiner and D. G. Howery, Anal. Chem., 44, 1189 (1972). (21)P. H. Weiner and J. F. Parcher, Anal. Chem., 45, 302 (1973). (22)D. G. Howery, Anal. Chem., 46, 829 (1974). (23)P. H. Weiner and J. F. Parcher, J. Chromatogr. Sci., 10, 612 (1972).
RECEIVEDfor review February 4, 1974. Accepted October 8, 1974.
Rapid Method for the Determination of the Composition of Natural Gas by Gas Chromatography J. S. Stufkens and H. J. Bogaard Koninklijke/Shell-Laboratorium, Amsterdam, Shell Research E. V., The Netherlands
A rapid method is described for the determination of the composition of methane-rich natural gas by gas chromatography using one single, temperature-programmed column packed with Porapak-R. The detection system comprises two detectors operated in series, viz. a thermal conductivity detector for N2, 02,C02, and C2H6and a flame ionization detector for C2Hs and heavier hydrocarbons. The methane content of the natural gas is calculated by subtracting the sum total of the minor components from 100%. The calorific value can be calculated from the composition with a relative standard deviation ( n = 17) of 0.22%. Comparison of reference and found calorific values shows that the systematic error is negligible.
The past few decades have seen an impressive growth in production and consumption of natural gas. Methods to determine its quality have gained correspondingly in importance. The prime factor is the calorific value, which governs the sales value of the gas. I t can be measured experimentally with a calorimeter, but most of the commercially available calorimeters for routine purposes have an accuracy no better than 1-l1&h. A more exact calorific value is attainable by calculation from the composition (I ), if the latter can be determined accurately. Several gas-chromatographic methods for the determination of the composition of natural gas have been described in the literature. Sears ( 2 ) has developed a method using
ANALYTICAL CHEMISTRY, VOL. 47, NO. 3, MARCH 1975
383