Classification of Gasoline Grades Using Compositional Data and

Aug 31, 2007 - Classification of Gasoline Grades Using Compositional Data and ... Gaussian model parameters that best described the data. Initially, a...
0 downloads 0 Views 113KB Size
3406

Energy & Fuels 2007, 21, 3406–3409

Classification of Gasoline Grades Using Compositional Data and Expectation–Maximization Algorithm Nikos Pasadakis,* Andreas A. Kardamakis, and Popi Sfakianaki Mineral Resources Engineering Department, Technical UniVersity of Crete, Chania 73100, Greece ReceiVed April 3, 2007. ReVised Manuscript ReceiVed June 25, 2007

This work demonstrates the application of an expectation–maximization (EM) algorithm in classifying gasoline samples belonging to different commercial grades based on gas chromatography (GC) and gas chromatography–mass spectrometry (GC–MS) compositional data. The classification process was based on an “optimal” subset of compositional variables, which were identified by means of a variable reduction method that maintained a multivariate data structure. The EM algorithm was then applied on this variable subset to determine the Gaussian model parameters that best described the data. Initially, an evaluation of the methodology was carried out on published GC–MS data of 88 Canadian gasoline samples, and the results from our study were compared to the results that were already presented in past literature. The methodology was subsequently tested on GC data from 74 Greek gasoline samples analyzed in our laboratory. The conjunction of variable reduction with the EM algorithm has proven to be a successful and reliable classification tool for gasoline samples belonging to different commercial grades (premium, regular, winter, and summer) in both data sets.

Introduction Positive identification, discrimination, and classification of crude oils and their products are a common requirement in many applications, such as quality control, forensics, and environmental studies. Adulteration of commercial fuels (illegal blending of different grades and mixing with solvents) is not a rare situation in the market in an attempt to raise profit margins.1 Furthermore, the determination of the origin and the ability to distinguish fuel samples taken from different service stations and/or refineries are of great importance in arson investigations.2 In addition, source identification is also crucial in investigations dealing with spills of crude oil and its fractions in the environment.3 Finally, the classification of crude oils is essential in geochemical and reservoir management studies.4 Given the compositional complexity of petroleum fluids, the determination of their origin and group affiliation is complicated. In the case of commercial petroleum fluids, this problem is usually tackled by adding characteristic tracers. This approach cannot generally be adopted in all cases because of the significant cost for marking and detection. Consequently, the development of fast and reliable analytical techniques, capable of certifying the quality and authenticity of petroleum hydrocarbon mixtures is undoubtedly essential. A large number of analytical works have been published that address the issues that were stated above. A limited selection of studies that focus on the classification and discrimination between petroleum-derived samples is briefly presented below. Gas chromatography (GC) and/or gas chromatography–mass * To whom correspondence should be addressed. Telephone: +30-2821037669. E-mail: [email protected]. (1) Pereira, R. C. C.; Skrobot, V. L.; Castro, E. V. R.; Fortes, I. C. P.; Pasa, V. M. D. Energy Fuels 2006, 20, 1097–1102. (2) Tan, B.; Hardy, J. K.; Snavely, R. E. Anal. Chim. Acta 2000, 422, 37–46. (3) Lavine, B. K.; Ritter, J.; Moores, A. J.; Wilson, M.; Faruque, A.; Mayfield, H. T. Anal. Chem. 2000, 72, 423–431. (4) Peters, K. E.; Fowler, M. G. Org. Geochem. 2002, 33, 5–36.

spectrometry (GC–MS) are the most commonly employed analytical techniques. The use of multivariate chemometric methods is essential, because visual inspection for sample matching and classification is problematic as a result of the complexity of the chromatograms. Furthermore, this approach heavily depends upon the analyst experience and skills. In an early work, Clark and Jurs5 managed to classify crude oils according to their geographical origin by employing gas chromatographic data with Bayesian discriminant analysis. In a more recent study, we have shown that gas chromatographic data from the gasoline range hydrocarbons of crude oils, when analyzed using principal components analysis (PCA), can successfully classify oil samples in accordance with their family affiliation.6 In another work,7 we presented that the sources of mixed oil spills in the subsurface of a petroleum refinery may be identified efficiently using lumped GC-derived compositional data that are treated with PCA. In two other studies, neural networks8 and PCA combined with analysis of variance (ANOVA)9 have been employed for classifying five types of commercial jet fuels. Pierce et al.10 demonstrated a method to classify gasoline samples obtained from different service stations using aligned GC profiles and PCA on features that were selected using ANOVA. The same approach was also employed by Watson et al.11 with GC–MS signals of similar gasoline samples. Differences existing in the hydrocarbon profile as well as specific components, such as tetra-alkyl lead and polar and (5) Clark, H. A.; Jurs, P. C. Anal. Chem. 1979, 51, 616–623. (6) Pasadakis, N.; Obermajer, M.; Osadetz, K. G. Org. Geochem. 2004, 35, 453–468. (7) Pasadakis, N.; Gidarakos, E.; Kanellopoulou, G.; Spanoudakis, N. EnViron. Forensics 2006, accepted for publication. (8) Long, J. R.; Mayfield, H. T.; Kromann, M. V. P. R. Anal. Chem. 1991, 63, 1256–1261. (9) Lavine, B. K.; Mayfield, H.; Cromann, P.; Faruque, A.; Mayfield, H. T. Anal. Chem. 1995, 67, 3846–3852. (10) Pierce, K. M.; Hope, J. L.; Johnson, K. J.; Wright, B. W.; Synovec, R. E. J. Chromatogr., A 2005, 1096, 101–110. (11) Watson, N. E.; van Wingerden, M. M.; Pierce, K. M.; Wright, B. W.; Synovec, R. E. J. Chromatogr., A 2006, 1129, 111–118.

10.1021/ef070165u CCC: $37.00  2007 American Chemical Society Published on Web 08/31/2007

Classification of Gasoline Grades

polyaromatic constituents, were used to differentiate and classify samples..12,13 The use of sophisticated analytical techniques, such as GC–MS, in combination with additional sample pretreatment and separation of specific component groups, has proven to provide more detailed insight in the sample composition, by unraveling characteristic compounds and patterns. Ichikawa et al. proposed a methodology, which was based on the pattern of the total mass spectrum derived from the GC–MS analysis, to discriminate between premium and regular gasolines using PCA and linear discriminant analysis (LDA). Doble et al.14 used concentration profiles of 44 individual components identified by GC–MS in Canadian gasoline samples to classify them into premium- and regular- as well as summer- and winter-quality grades. The classification algorithm consisted of PCA combined with a Mahalanobis distance measure to compute the grouping tendencies. Furthermore, in an attempt to improve the classification performance, neural networks were trained by using the normalized compositional data. This current work consists of two classification studies employing different data sets. The first one includes the GC–MS compositional data, previously presented by Doble et al.14 on 88 Canadian gasolines, belonging to premium and regular grades from both winter and summer production batches. The second set contains GC compositional data from 74 Greek unleaded gasoline samples of regular and premium grades. The aim of the study was to examine the applicability of an expectation– maximization (EM) algorithm to model parameters (in this case, Gaussian model parameters) that can classify the gasoline samples into the existing quality categories (regular, premium, winter, and summer). A variable reduction procedure was employed prior to the classification, aiming to determine the subset of chromatographic peaks (components) that ensures the best possible discrimination between gasoline samples of different quality. Chemometric Methods Chemometric methods have found wide application in exploration and interpretation of analytical data from petroleum hydrocarbon mixtures because of their compositional complexity. Grouping tendencies may be obscured by irrelevant or redundant variables that may not only add a burden at the cost of dimensionality but also risk cluttering the useful information with noise. Therefore, it is essential to explore the data and seek “interesting” variables that may yield information about the nature of the running problem. To tackle this issue, variable reduction procedures can be employed. At the same time, instead of dealing with the classification task as a black-box procedure (the case of neural networks), it is often preferred to express the existing groups in parametric form of some type of statistical model as a function of the random variable distribution. On the basis of these concepts, the current work is built on a classification framework that combines a variable reduction method and an expectation–maximization algorithm. Both of these computational procedures are briefly described below. Variable Selection. Variable selection belongs to a family of chemometric techniques in exploratory multivariate analysis that is used for analyzing complicated data sets. There are many potential benefits of variable selection: facilitating data visual(12) Johnson, K. J.; Rose-Pehrsson, S. L.; Morris, R. E. Pet. Sci. Technol. 2006, 24, 1175–1186. (13) Sandercock, P. M. L.; du Pasquier, E. Forensic Sci. Int. 2003, 134, 1–10. (14) Doble, P.; Sandercock, M.; Pasquier, E. D.; Petocz, P.; Roux, C.; Dawson, M. Forensic Sci. Int. 2003, 132, 26–39.

Energy & Fuels, Vol. 21, No. 6, 2007 3407

ization and understanding, reducing measurements and storage requirements, reducing training, utilization times, and dimensionality to improve predictor performance.15 In chemometrics, it is well-known that a large number of measured chromatographic peaks often lead to multicollinearity and redundancy, complicating the detection of characteristic data patterns.16 PCA has been widely applied in multivariate studies and data mining to explore high-dimensional data structure. PCA linearly transforms an original variable space (of size p) into a factor space of reduced dimensionality, in which latent variables that are orthogonal to each other [principal components (PCs)] represent the vast majority of the original variance. Most of the information present in the original multivariate data set can be represented by k PCs (where k < p), thus drastically reducing the dimensionality of the features (concentration profiles in our case). Although the dimensionality of the original variable space may be reduced from p to k dimensions by using PCA, all p original variables are still needed to define the k new variables.17 This disadvantage is tackled by several variable selection methods that employ PCA. The challenge in our case is to employ original data variables rather than the latent variables of observations (PCs). The goal is to reduce the number of original variables without losing a significant amount of information in the meanwhile. One way of identifying the significant original variables is based on sorting the PC loading factors. Jolliffe proposed a method based on this idea for associating original variables with high-loading PC factors.18,19 This results in a subset selection that preserves most of the variation of the original data set. McCabe developed a method whereby variables (referred to as “principal variables”) were selected by optimizing various criteria.17 Malinowski, on the other hand, selected a key set of variables being as orthogonal as possible among them and called this the “key set factor analysis”.20 Another criterion that improves optimal variable selection is data structure preservation among landmarks. Krzanowski implemented this kind of variable selection approach by using Procrustes criterion in principal component space.21 To ensure data structure preservation in the selected subset, a direct comparison between individual landmarks (points) of both sets is conducted in PC space (i.e., if X (n × p) is the size of the data matrix, the essential dimensionality of the data used in any comparison is k, where this value is determined by crossvalidation techniques or inspection of the reduced eigenvalues.22 The similarity measure is conducted by using the Procrustes criterion, which measures the residual sum of squared differences (D) between corresponding points of the PC subset and original PC variable set, D ) trace{YY' + ZZ' - 2Σ}

(1)

where Y(n × k) denotes the complete data set in its transformed PC configuration and Z(n × k) is the PC transformed candidate ˜ (n × q) with q subset matrix (where the candidate subset is X < p (q is the number of selected variables). Σ is a diagonal matrix of the singular values of the product Y′Z. This quantity (D) computes the proximity (it can be thought of an absolute distance in k-dimensional space that emerges after translation, (15) Guyon, I.; Eliseff, A. J. Machine Learn. Res. 2003, 3, 1157–1182. (16) Guo, Q.; Wu, W.; Massart, D. L.; Boucon, C.; de Jong, S. J. Chem. Int. Lab. Syst. 2002, 61, 123–132. (17) McCabe, G. P. Technometrics 1984, 26, 137–144. (18) Jolliffe, I. T. Appl. Stat. 1972, 21, 160–173. (19) Jolliffe, I. T. Appl. Stat. 1973, 22, 21–31. (20) Malinowski, E. R. Anal. Chim. Acta 1982, 134, 129–137. (21) Krzanowski, W. J. Appl. Stat. 1987, 36, 22–33. (22) Wold, S. Technometrics 1978, 20, 397–405.

3408 Energy & Fuels, Vol. 21, No. 6, 2007

Pasadakis et al.

˜, rotation, and scaling) between the configurations of X and X thus justifying it as an appropriate measure to minimize. The “best” subset of q variables (note that k < q < p) will be the subset that yields the smallest value of D. Because the goal is to find the optimum subset of variables (q) that best maintains the multivariate structure (minimal D) of the complete data set, one must apply a search technique to explore the space of candidate solutions. In Krzanowski’s implementation, the variable subset is retrieved by using a backward elimination procedure. For a detailed view of this procedure, the reader is referred to the work of Krzanowski.21 Expectation–Maximization. Expectation–maximization is based on the notion of maximum likelihood and is widely used in data mining for optimal parameter search in various statistical models.23 Consider real-valued Rd random vectors x1, ..., xq that are assumed to be samples from a mixture of densities, K

f(x) )

∑ p f(x;a ) k

(2)

k

k)1

where the pk values are the mixing weights (0 < pk < 1, ∀k ) 1, ..., K and ∑kpk ) 1) and the f(x; ak) are densities from the same parametric family. In our case, every distribution f(x; ak) denotes the d-dimensional normal density with unknown mean µk and covariance matrix Σk, where 1 f(x;ak) ) (2π)-d⁄2|Σk|-1⁄2exp[- (x - µk)TΣ-1 k (x - µk)] (3) 2 and ak ) (µk, Σk). In the mixture maximum likelihood, the parameters pk and ak are chosen to maximize the log likelihood

{

L ) log

n

K

}

∏ ∑ p f(x , a ) k

i)1 k)1

i

k

(4)

using expectation–maximization. Parameter estimation of a model is performed iteratively, starting from an initial guess. Every iteration consists of an expectation (E) step, which finds the distribution for the unobserved variables, given the known values for the observed variables and the current estimate of the parameters, and a maximization (M) step, which re-estimates the parameters to achieve a maximum likelihood, using the distribution information determined in the E step.24 In this work, data clustering is realized by using a simple Gaussian distribution to model each class of data (i.e., K ) 1, implying that pk ) 1). This reduces the parameter estimation problem to a less elaborative procedure. Vlassis and Likas25 greedy version of the EM algorithm was employed in our analysis to determine the parameters of the general multivariate Gaussian model. It was shown that it successfully deals with fundamental difficulties, such as parameter initialization, the determination of the optimal number of mixing components, and the retrieval of a global solution. Samples and Experimental Procedures Compositional data derived from two sample sets were considered in this study. The first one was presented by Doble et al.14 based on a Canadian Petroleum Institute report. This data set contained GC–MS compositional information from gasoline samples belonging to four quality grades (namely, regular and premium unleaded gasolines, produced under winter (23) Verbeek, J. J.; Vlassis, N.; Krose, B. Neural Comput. 2003, 5, 469– 485. (24) Learning in Graphical Models; Neal, R. M., Hinton, G. E., Jordan, M. I., Eds.; MIT Press: Cambridge, U.K., 1999; pp 355–368. (25) Vlassis, N.; Likas, A. Neural Process. Lett. 2002, 15, 77–87.

and summer specifications). Each group contained 22 samples, and their composition was expressed as the normalized concentration of 44 individual components, thus resulting in an 88 × 44 data matrix. The second data set contained GC compositional data from gasoline samples of unleaded premium and regular grades, exhibiting 100 and 95 research octane number (RON), respectively. Summer- and winter-quality samples were collected between May and November 2006 from the product tanks of a Greek refinery. A total of 74 samples were analyzed in our laboratory, including 50 samples of regular and 24 samples of premium unleaded gasoline. The samples were stored in 40 mL amber glass vials at 4 °C prior to the analysis. They were analysed using a Perkin-Elmer 8700 GC, equipped with a capillary column Supelco SPB OCTYL 60 m × 0.25 mm × 1.0 µm. GC oven temperature was programmed at 35 °C for 30 min, followed by a ramp rate of 2 °C/min up to 280 °C. The injector (1:150 split/splitless) was set at 250 °C, and the FID detector was set at 300 °C. Millenium32 software from Waters was used for data collection and peak treatment. A total of 54 components were identified on the basis of their retention time using a reformate analytical standard from Supelco. Additionally, 34 components were determined using published Kovats retention indexes.26 The relative concentration of each constituent was calculated as the normalized peak area to the total area of the 88 identified hydrocarbons. The experimental data matrix (74 × 82) is publicly available from our laboratory upon request. Results and Discussion Case 1. This case study aimed to evaluate the classification performance of the proposed chemometric methodology and compare its results with the ones obtained by Doble et al.14 when applied to the same data set. All computation was carried out using Matlab software and in-house written routines. The goal of the classification was to successfully discriminate samples simultaneously into the four previously described quality grades. Initially, after the procedure described above for variable reduction, the initial number of variables (in this case, 44) was reduced to 10. The data set was treated mean-centered along each variable. The input parameter k has been set to be the number of principal components required to account for 90% of the variance in the initial data matrix. Here, this parameter was equal to five (k ) 5). A total of 10 variables were selected on the basis of the D measure (eq 1). This quantity is plotted in Figure 1 as a function of the number of variables that were excluded from the complete data set. The number q was determined by visual examination of the exponential-type curve at the value when D asymptotically converges to the x axis. These variables (Table 1) were considered as the most informative because they effectively preserve the data structure in the first five principal component spaces. The resulting subset data set (88 × 10) was then fed to the EM algorithm (in standardized form) to determine the parameters of each class model, knowing a priori the number of classes (N ) 4, in this case). The algorithm, presented by Vlassis and Likas25 was used for the calculation of the statistical model parameters (µ and Σ). When the entire population of samples was included in the classification scheme, the algorithm was able to “learn” the group affiliation of each sample without any misclassification. To check the predictive ability of the Gaussian (26) American Society for Testing and Materials (ASTM) D5134-98. Standard Test Method for Detailed Analysis of Petroleum Naphthas through n-Nonane by Capillary Gas Chromatography. 2003.

Classification of Gasoline Grades

Energy & Fuels, Vol. 21, No. 6, 2007 3409

Figure 1. D measure displayed as a function of the number of variables for the first gasoline sample set. Note that the number on the abscissa (39) is the total number of the remaining variables after the selection of k. Table 1. List of the Selected Variables for Both Analyzed Gasoline Sample Sets number

case 1

case 2

1 2 3 4 5 6 7 8 9 10

2,2,4-trimethylpentane pentane isopentane toluene butane 1,2,4-trimethylbenzene m-xylene 2-methylhexane o-mylene 2,3,4-trimethylpentane

toluene 3-ethylpentane methylcyclopentane 2-methylpentane iso-pentane iso-butane m-xylene n-butane cyclopentane 3-methylpentane

models on “unseen” samples, 20% of the population of each group (i.e., 5 samples from each class) was randomly selected as a test set. The models were trained on the remaining 80% of the samples and subsequently applied on the test set; i.e., the parameters µ and Σ obtained from the training scheme were used to compute the maximum likelihood of the rest of the data. One misclassification occurred during this trial. This procedure was applied several times using different training and test subsets to ensure the robustness of the classification procedure. All of the trials resulted to equally successful classifications with a hit rate that did not fall below 95%. This rate is approximately 30% higher than that reported by Doble et al.,14 for the demanding case in classifying simultaneously all four groups. Figure 2 shows a scatter plot of two of the selected variables (concentration % wt of toluene and iso-pentane) in standardized form. The ellipses drawn on top of the data are the twodimensional projections of the multivariate Gaussian models (10 dimensions in this case) onto the two-dimensional variable space. These two variables were selected at random to illustrate at a quick glance the difficulty of the classification task. A crude separation can be perceived between the four groups without any clear boundaries at this dimension. To be able to successfully discriminate between the classes, more information is given by extra variables that add to the dimensionality of the model. This is the reason why variables need to be carefully selected because they may give the required characteristic feature that can lead the Gaussian model to confine the appropriate cluster.

Figure 2. Scatter plot of two selected variables (standardized conentration of toluene and isopentane) for case 1. Tick marks: “×”, regular winter; “O”, regular summer; “0”, premium winter; “/”, premium summer.

Case 2. In this case, the classification task was to discriminate between regular and premium unleaded gasoline samples. The same classification methodology was employed as in the previous case. The variable reduction procedure was applied to the data matrix (74 × 82), and a total number of 10 variables (Table 1) were preserved with k ) 5 explaining 95% of the total initial variance in PC space. The number of variables was determined in a similar fashion to the previous case. When the entire population of samples was used, the algorithm was able to “learn” the group affiliation of each sample without any misclassification. Subsequently, the Gaussian models were trained on a population of 15 samples from each sample group. All trials yielded perfect classification (i.e., 100% hit rate) no matter what the size of the training set was (as long as the number of samples was equal to or greater than the number of variables in the subset). Conclusions This classification framework, which combines variable reduction with expectation–maximization, has proven to be a successful tool that is able to classify gasoline samples of different commercial grades. We demonstrated that the use of multivariate EM–Gaussian models could overcome limitations in classificatory power, in comparison to other methods, such as PCA or neural networks. This has been shown by evaluating the performance of this current method to other previously reported results. The selection of a limited set of significant variables, which ensures true classification within the sample population, may reduce significantly the required experimental measurements, especially when sample properties have to be measured by different independent procedures. Finally, because the proposed method uses GC-derived compositional data, which is a widely available and cost-efficient analytical technique, it can be easily adopted as a routine procedure in applications dealing with petroleum fluid classification. EF070165U