Classification of petroleum pollutants by linear discriminant function

Classification of petroleum pollutants by linear discriminant function analysis of infrared spectral patterns. James S. Mattson, Carol S. Mattson, Mar...
3 downloads 0 Views 421KB Size
Classification of Petroleum Pollutants by Linear Discriminant Function Analysis of Infrared Spectral Patterns James S. Mattson,*’ Carol S. Mattson, Mary Jo Spencer, and Frank W. Spencer Rosenstiel School of Marine and Atmospheric Science, University of Miami, 4600 Rickenbacker Causeway, Miami, Fla. 33 149

Digitlred transmission infrared spectra of 194 oils (62 crude oils, 60 No. 2 and diesel fuels, 28 No. 6 fuels, 22 waste crankcase lubes, 12 No. 4 fuels and 10 No. 5 fuels) were subjected to pattern classlfication using linear discriminant function analysis (LDFA). Several “declsion tree” schemes were tested to develop a hlgh predictive capability ultimately resulting in a “recognition power” of 97.5 %. The highest average probabilityfor class membership comes from the waste crankcase lubricants ( p = 0.898) while the lowest average probability not unexpectedly arises from the crude oils ( P = 0.838 )

.

The classification of multicomponent petroleum oils (crude oils, used lubricants, distillate and residual fuels) solely by their infrared absorption spectra is a difficult task. Crude oils include a wide variety of different mixtures, from heavy asphaltic crudes to light crudes that are similar to a diesel fuel. Furthermore, the distinctions between classes of fuel oils (Le., No. 1,2,4,5,and 6 fuels) are based upon ASTM specifications for continuous properties such as flash point and viscosity, rather than any chemical distinctions that might be employed a priori in infrared analysis. In South Florida, for example, loeal fuel oil suppliers meet requirements for No. 4 or 5 fuel oils by blending appropriate proportions of No. 2 and 6 fuels. In order to reduce the amount of sampling required in the event of an oil pollution incident, it is useful to be able to initially classify the pollution sample into one of the above groups. Infrared spectrometry has been suggested as a useful analytical technique for oil classification and identification, since it does provide some information on the aliphatic, aromatic, polynuclear aromatic, carbonyl, and organosulfur composition of an oil (1-5). Infrared spectra have been used in previous efforts to distinguish asphalts from residual fuels ( I ) ,and to provide a tool for “fingerprinting” oils (2-4). Kawahara et al. ( 5 )applied linear discriminant function analysis (LDFA) to their infrared data ( 1 ) to distinguish only two classes, asphalts and residual fuels. In this study using fresh, unweathered oils (except for 22 used crankcase lubes), and a high-resolution, sophisticated computer-spectrometer system, the primary considerations included variable selection, experimental precision, and choosing between parametric and nonparametric pattern recognition techniques for the discrimination of six classes of oils. Pattern recognition is not new (6), but its application to chemical analysis is of recent vintage (7-12).Two powerful nonparametric pattern recognition techniques, the learning machine and the K-nearest neighbor (KNN) approach, have been applied to both mass and infrared spectra. For a review of the learning machine technique, one should see Jurs and

Isenhour ( 1 3 , 1 4 ) ,while the KNN approach is described in the recent review by Kowalski ( 1 5 ) .The classical pattern recognition technique, linear discriminant function analysis (LDFA), is based upon an assumption of multivariate normal statistics within each class, and is discussed in detail elsewhere ( 16-1 8 ) . Efforts to apply pattern recognition techniques to chemical data have usually relied on major differences existing between groups; i.e., using mass spectral or infrared data to distinguish between carbonyl and non-carbonyl-containing compounds. In addition, the original data have usually been derived from data files prepared for other purposes (Sadtler infrared spectra or API Project 44 mass spectra), and have been reduced to binary (peak/no-peak) or other simplified intensity formats. In their studies similar to this one, Kawahara et al. (5,19) employed manually digitized infrared spectral patterns derived from only seven major absorption bands, 1600, 1460, 1375,1027,870,810, and 720 cm-l. They ( 5 1 9 )used the BMD computer program (16),which was also employed in the instant study, to develop linear discriminant functions which would correctly classify 18 asphalts and 20 No. 6 fuels in one case ( 5 ) ,and 19 asphalts, 2 1 No. 6 fuels, and 6 crude oils in the other (19).To eliminate the need for pathlength information, Kawahara et al. (5,19)transformed their seven variables into the 21 possible ratios, a valid transformation, as well as the inverses of the 2 1 ratios, a superfluous transformation. The binary distinction between the 18 asphalts and 20 residual fuels (5) was nearly perfect, with all 38 samples correctly classified with an average probability of about 0.99. The separation of asphalts, No. 6’s, and crudes was less than perfect, with one asphalt classifying as a No. 6 and one No. 6 classifying as a crude. Kawahara and Yang (19) tested their classification scheme with four unknowns from actual spill cases, and the prediction agreed with classifications derived from other analytical procedures. The powerful nature of this tool (LDFA) in reducing the sampling required to identify the source of clandestine oil spill cannot be overstated. In this study, a larger sampling (194 oils) over more classes (six) was attempted, using a higher degree of sophistication in data acquisition than was employed by Kawahara et al. (5, 19).In addition, a modification of the method for analyzing infrared data ( 3 ) produced a Gaussian data set, a substantial improvement over earlier highly skewed data sets (1,2,5,19). The increased complexity of the problem lowered the predictive success (189/194 = 97.5% correctly classified), as well as the average probability of class membership ( P = 0.878); however, continued efforts by the authors and by Clark and Jurs (20) have produced no improvement in predictive capability.

EXPERIMENTAL

* Present address, National Oceanic and Atmospheric Administration, Environmental Data Service, Center for Experiment Design and Data Analysis, 3300 Whitehaven Street, N.W., Washington, D.C. 20235. 500

ANALYTICAL CHEMISTRY, VOL. 49, NO. 3. MARCH 1977

The data set consists of 194 patterns, each comprised of 21 peak heights in absorbance units, using the absorbance at 1990 cm-’ as a fixed zero for all peaks in the 2000 to 650 cm-l region. The experimental parameters and peak selection criteria are thoroughly discussed in an earlier, companion paper ( 3 ) and the Perkin-Elmer

UNKNOWN OIL

Table I. Average Cumulative Probabilities of Class Membership Predicted for 194 Patterns 22 Waste crankcase lubes 60 No. 2 and diesel fuels 12 No. 4 fuels 10 No. 5 fuels 28 No. 6 fuels 62 Crude oils Overall average: 0.878

r-----l

0.898 0.894 0.878 0.878 0.880 0.838

L

NON-CRUDE

4,5, OR 6

WASTE LUBE

2 O R DIESEL

r-----7

No 6 FUEL

NO 4OR5

I

I

180-DataGeneral NOVA data acquisition system has been described elsewhere (21).

RESULTS AND DISCUSSION Pattern recognition involves the use of a series of similar observations made on a large group of objects, either with the intent to separate the objects into subgroups according to similarities between the individual objects, or with the intention of developing a set of rules by which future unknown objects can be classified into known subgroups. The former is called unsupervised learning, and can be used to identify previously unknown similarities between elements of the overall group. The principal method of unsupervised learning is cluster analysis, and is used to divide a large number of elements into two, three, or more subgroups, without any a priori knowledge of either the number of subgroups expected, or the property which most clearly separates the individual objects into the subgroups. Curtis and Starks (22) are developing a simple, elegant procedure for classifying and identifying oils using cluster analysis techniques with UV-fluorescence data. In this study, the number of subgroups was known, and a 194-member training set was available, in which the subgroup assignment of each element was known. The property which would separate the elements of the training set was not known, however. The goal of the pattern recognition study then was to examine all of the inter- and intra-group pattern relationships for the 194-oil training set, in an effort to develop a set of rules for the classification of future unknowns. This process, in which a training set consisting of objects with known

No. 4 FUEL

No. 5 FUEL

Figure 1. “Decision tree” scheme used to classify oils by linear discriminant function analysis

properties is used to determine a predictive classifier for unknown objects, is called supervised learning. One method of testing predictive classifier functions is the straightforward approach of treating each member of the training set in turn as an “unknown”, and predicting its class based upon the information contained in the 193 remaining “knowns” in the training set. Kowalski (15) calls this the “leave-one-out” technique, and it is used in the recently developed nonparametric approaches to pattern recognition. For the classical parametric method, called linear discriminant function analysis (LDFA), the classifier functions are tested by their “recognition power” (7), or the ability to correctly classify all of the members of the training set. If the frequency distributions for data are not known, or if the distributions are known to differ significantly from a normal (Gaussian) distribution, nonparametric pattern recognition methods should be considered before attempting to employ LDFA. Nonparametric techniques make no assumptions concerning the distribution of the data and therefore should have a wider range of application. Two principal nonparametric processes have been derived; Knearest neighbor and learning machine. In the K-nearest neighbor technique, the Euclidian distance between points in n-space is used as a measure of object similarity. An unknown is assigned to a class based on the class memberships

Table 11. Linear Discriminant Function Coefficients for Oil Classification Variable, Frequency, No. cm-’ 1

2 3 4 5 6 7 8 9 10 11

12 13 14 15

16 17 18

19 20 21

1629 1603 1518 1304 1166 1154 1032 963 918 888 870 846 832 809 793 781 765 741 722 697 673

Constant term

Step 1 Crude

... ... 10.0

... -55.7

...

145.9 86.3

... ... ...

*..

... ...

Noncrude

Waste lube

Step 2 Two and diesel

..* ..

-200.4 390.8 21.6 -291.1 300.2

-73.9 133.0 19.6 42.1 41.6

-125.6 150.1 10.9 187.7 3.4

-80.5

111.9 493.7 58.1 -267.5 272.1 21.8

183.7 8.2 782.3 -82.5 -84.1 -102.5

-158.2 -62.3 609.5 185.7 80.7 -223.3

...

-11.4

..*

-19.2

...

47.0 191.1

... ... ...

...

...

...

... ... ...

... ... ... ...

... 12.9 ... ... ... ...

16.9

-7.8

-101.0 13.6 24.2 439.0 -157.3 253.5

-19.9

-198.9

-8.1

-17.7

...

*..

... ...

94.9 26.5 3.0 300.0 -47.2 -208.2 -112.4

Four, five, and six

...

... ... ...

71.8 44.1 -2.3 253.5 -47.1 -53.1 -132.1

Step 3 Four and five Six

Step 4 Four

Five

527.2 334.5 -611.3

2970.3 -453.3 -1056.8

...

... ...

628.1

942.6

82.8 231.3

35.2 124.1

2376.2

2531.1

-105.2

...

...

...

...

112.6

230.9

...

...

46.7 -329.9

1167.3

2700.9

-3.1

-525.7 1833.3 -1336.6 428.6

-1070.1 2633.0 -1072.6 -74.8

-409.3

-622.6

... 274.9

...

..

...

...

-25.9

83.1

-43.2

... ...

...

...

... 8.5

...

...

-1411.7 3917.9 -4174.3 -2641.2 -706.4

186.9

...

...

...

-982.8 3203.6 -1984.6 -3041.1 -250.9

175.2 -576.4

...

...

-59.7

... ...

...

...

~~

ANALYTICAL CHEMISTRY, VOL. 49, NO. 3, MARCH 1977

501

of its K (K = 1 , 3 , . . .) nearest neighbors. Kowalski ( 1 5 )used this procedure to classify obsidian samples by geographic origin based upon their trace element content. Nearest neighbor computation by the authors and by Clark and Jurs (20) proved fruitless, with the 1NN technique producing the best results of any KNN method tried, but the success rates were only on the order of 60% correct. Recently, Duewer et al. (23) employed a new supervised learning technique called "statistical isolinear multicomponent analysis" on the old Gulf General Atomic trace element data on 20 crudes and 20 residual fuels ( 2 4 ) , in an effort to produce an identification method which would be unaffected by weathering. Linear discriminant function analysis (LDFA) assumes multivariate normal distribution for all variables, and involves the computation of m classifier functions to separate m classes. The derivation of the m classifier functions, which are linear in the n variables for each pattern, is based upon a weighting scheme which maximizes the contribution of those variables which are most effective in distinguishing a given class from the rest of the population. The classifier function for each group m is in the form of Equation 1, y ! ( x l ,x 2 , .

, xP,)

=

co + J5 W:XLJ =1

cumulative probabilities are just the products of the recognition power for step 1 (0.968 for crudes and 0.977 for noncrudes), unity for each succeeding intermediate step, and the average probability of class membership computed by Equation 2 at the final step. Since the linear discriminant functions computed in this study are applicable to infrared spectra from any instrument, and as there will be no pathlength dependence in the classification of a given pattern, the discriminant functions given in Table I1 are applicable to any infrared patterns obtained as described in Ref. 3. Only the absorbance linearity of the instrument, cleanliness of the KBr windows, absence of water, and proper baseline method are important. One has simply to tabulate the 21 peak heights for the spectrum of an unknown, and then form the Yt functions (Equation 1)using the coefficients in Table I1 for each step of the decision tree. If, for example, Yfrude is greater than YFon-crude then the sample is classified as a crude with a probability according to Equation 2 . This probability is then multiplied by the predictive probability for crudes of 0.968 to yield a cumulative probability that the sample is a crude oil. If Yron-crude had exceeded y;rude , one would have gone to step 2 , and so on.

LITERATURE CITED

(1)

where Yi is related to the probability that the ith pattern belongs to class 1, C, is a constant term and W: is the weighting coefficient for the j t h variable. For m classes (1 = 1,2, . . . ,m ) , there are m sets of constant terms and weighting coefficients. The weighting coefficients W: are computed by the BMDO7M computer program ( 1 6 ) ,which computes values of Wj that favor those variables that show the greatest tendency to separate classes. BMD07M enters variables sequentially, beginning with the variable that has the maximum capability to separate classes, and then adding variables stepwise using the same criterion. The probability that a given pattern belongs to class 1 is then given by Equation 2 ,

where Q1 is the prior probability that an unknown belongs to class 1 (all Q1 = 1.0 in this test). An element is then classified along with the group for which the highest Pi is obtained. Separating all 194 oils into their six classes simultaneously proved to be an unsatisfactory classification scheme, so a large number of possible "decision trees" were postulated and tested. A "decision tree" approach implies multiple levels of decision, possibly with all levels being binary decisions, but not necessarily. The final decision tree settled upon is shown in Figure 1. This scheme calls for decisions in the following order: 1. Separate crudes from non-crudes. 5's 6's. 2's and 2. Separate non-crudes into No. 4's diesel fuels, and waste crankcase lubes. 3. Separate fours and fives from sixes. 4. Separate No. 4's and No. 5's. The recognition power of this scheme is limited by the success at step 1,where two crudes classify as non-crudes and three sixes classify as crudes. After the first step, the remaining separations are perfect. Table I summarizes the average cumulative probabilities for each of the six classes. The

+

502

ANALYTICAL CHEMISTRY, VOL. 49, NO.

+

3, MARCH 1977

(16) (17) (18) (19) (20)

(21) (22)

(23) (24)

F. K. Kawahara, Environ. Sci. Techno/., 3, 150 (1969). J. S. Mattson, Anal. Chem., 43, 1872 (1971). J. S. Mattson. C. S. Mattson. M. J. Saencer. and S. A. Starks. Anal. Chem.. 49, 297 (1977). P. F. Lynch and C W. Brown, Environ. Sci. Techno/., 7, 1123 (1973). F. K. Kawahara, J. F. Santner, and E. C. Julian, Anal. Chem., 46, 266 (19741 G:S. Sebestyen, "Decision-Making Processes in Pattern Recognition", The Macmillan Co., New York, N.Y., 1962. P. C. Jurs, B. R. Kowalski, T. L. Isenhour. and C. N. Reilley, Anal. Chem., 41, 690 (1969). B. R. Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reillev. Anal. Chem.. 41, 695 (1969) 8. R. Kowalski, P. C. Jurs, T. L. Isenhour, and C. N. Reiliey, Anal. Chem., 41, 1945 (1969). P. C. Jurs, B. R. Kowalski, T. L. Isenhour, and C. N. Reilley, Anal. Chem., 41, 1949 (1969). B. R. Kowalski and C. F. Bender, Anal. Chem., 44, 1405 (1972). B. R. Kowalski, Anal. Chem., 47, 1152A (1975). T. L. Isenhour and P. C. Jurs, in "Computers in Chemistry and Instrumentation", Vol. 1, J. S. Mattson, H. B. Mark, Jr., and H. C. MacDonald, Jr., Ed., Marcel Dekker, Inc., New York, N.Y., 1973, pp 285-330. P. C. Jurs and T. L. Isenhour, "Chemical Applications of Pattern Recognition", Wiley, New York, N.Y., 1975. B. R. Kowalski, in "Computers in Chemical and Biochemical Research", VoI. 2, C. E. Kiopfenstein and C. L. Wilkins, Ed., Academic Press, New York, 1974. 00 1-76. W. J. Diion, Ed., "Biomedical Computer Programs", University of California Press, Berkeley, Calif., 1974, pp 221-254. T. W. Anderson, "Introduction to Multivariate Statistical Analysis", Wiley, New York, N.Y.. 1958. M. J. Spencer, "Oil IdentificationUsing Infrared Spectrometry", M.S. Thesis, University of Miami, Rosenstiel School of Marine and Atmospheric Science, Miami, Fla., 1975. F. K. Kawahara and Y. Y. Yang, Anal. Chem., 48, 651 (1976). H. A. Clark and P. C. Jurs, "Studies of Petroleum Sample IdentificationUsing Pattern Recognition Techniques", Abstract No. 325, Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy, Cleveland, Ohio, March 1-5, 1976. J. S. Mattson, Anal. Chem., 49, 470 (1977). M. Curtis and S. A. Starks, "Use of Ultraviolet Spectroscopy in Oil-Spill Identification", Abstract No. 327, Pittsburgh Conference on Analytical Chemistry and Applied Spectroscopy, Cleveland, Ohio, March 1-5, 1976. D. L. Duewer, B. R. Kowalski, and T. F. Schatzki, Anal. Chem., 47, 1573 (1975). D. E. Bryan, V. P. Guinn, R. P. Hackieman, and H. R. Lukens, "Development of Nuclear Analytical Techniques for Oil Slick Identification (Phase I)", Gulf General Atomic Report GA-9889, Jan. 21, 1970, (USAEC Contract AT (04-3)-167, Project Agreement No. 43.)

RECEIVEDfor review April 29,1976. Accepted November 4, 1976. This research was supported by the U.S. Coast Guard, Contract No. DOT-CG-81-75-1364.