Article pubs.acs.org/EF
Pattern Recognition Technology Application in Intelligent Processing of Heavy Oil Yi Zhao, Chunming Xu,* Suoqi Zhao, and Quan Shi State Key Laboratory of Heavy Oil Processing, China University of Petroleum, Beijing 102249, People’s Republic of China S Supporting Information *
ABSTRACT: Reliable and efficient product yield estimation for unknown oils after the fluid catalytic cracking (FCC) reaction is one of the key components in heavy oil intelligent processing. This paper describes the use of two chemometric pattern recognition methods, k-nearest neighbor (k-NN) classification and supervised self-organizing maps (SSOMs), for building classification models to determine the most similar oil sample to an unknown sample in a given data set and to use the FCC yields record of the correspondent oil as the product yield prediction for the unknown sample under the same reaction conditions. Two-sided t test, correlation analysis, and hierarchical cluster heat map analysis were performed to assess the quality of the models. The work provides laboratory evidence that k-NN or SSOMs techniques could all be employed for FCC product yield estimation, while the k-NN model would be more suitable for industrial application in terms of stability and efficiency.
Table 1. Significant Feedstock Properties Related to Each FCC Product Yield of Heavy Oil
1. INTRODUCTION Over the years, heavy oil has made up a fairly large portion of China’s energy consumption, because crude oil is becoming heavier and the yield of light distillates from crude oil could by no means meet the multiplying demand of light oils driven by the rapid development of the transportation and petrochemical industries. Heavy oil upgrading has become a main stream procedure in China’s petroleum industry. Fluid catalytic cracking (FCC) refineries have gathered abundant heavy oil processing data from their daily operations. It would be highly beneficial if the refineries could make active use of these massive holdings of past production data as reference to predict product yields and to determine FCC reactor operating parameters for future unknown heavy oils. The traditional approach for this purpose is by reaction kinetics analysis. However, the necessary chemical kinetics calculations are very time-consuming and are quite challenging for current computer capabilities. There is a need for estimation models that is efficient, affordable, and accurate to fit in as part of a heavy oil intelligent processing system. The chemometric1−5 pattern recognition technique leads to a way to solve this problem by revealing the internal relations of heavy oils.6−10 Pattern recognition technology11 has been successfully applied to the crude oil industry in fields such as spectral information definition,12,13 structure−activity relationship analysis,14−16 classification,17,18 crude oil geographical origin determination,19,20 crude oil quality control,21−24 etc. In this paper, two estimation models were trained by 33 heavy oil’s 11 key properties data using k-nearest neighbor (k-NN) and supervised self-organizing maps (SSOMs) approaches to determine which of the oil items are most similar to five test oil samples in measurements of properties to use their FCC product yields as the product prediction for the five test samples under the same reaction conditions. The prediction accuracy of the two models were both evaluated and approved by mathematical means, such as two-sided t test, correlation, and hierarchical cluster heat map, while the k-NN model was found to be more efficient and stable. The research in this paper lays the groundwork for the future © 2012 American Chemical Society
FCC yield
significant feedstock properties
gas gasoline diesel HCO coke
density, CR, MW, N, saturates, and resins density, CR, MW, S, H/C, Ni, V, and resins CR, N, S, H/C, Ni, and resins density, S, H/C, and V S, V, and resins
applications of the two pattern recognition techniques, k-NN and SSOMs, in heavy oil intelligent processing. To accommodate random unknown heavy oils in industry applications, the models will need to be trained with a vast data set (the entire past and real-time updated production data from the refineries would be ideal) and also to include other properties other than the 11 key properties discussed in this work.
2. EXPERIMENTAL SECTION 2.1. Data Preparation. Three typical Chinese heavy oils (from the Dagang refinery, Shengli refinery, and Liaohe refinery) were discussed in this work. To make the modeling training set cover a wider range of measurement space, the three heavy oil were separated by supercritical fluid extraction and fractionation (SFEF).27,28 In total, there are 38 items analyzed in this work, including the three heavy oils as well as their wide and narrow fractions of SFEF. To simplify the research, only key properties that have significant influence on product yields were measured and used in this work. These properties were selected by applying stepwise analysis to all of the heavy oil records in the State Key Laboratory of Heavy Oil Processing using R (programming language, http://www.r-project.org/) function stepwise. The result suggested that the variance of 11 feedstock properties [density, carbon residue (CR), average molecular weight (MW), element contents (N and S), hydrogen/carbon ratio (H/C), metal content of Ni and V, saturates, aromatics, and resins] are significant to the difference in Received: June 6, 2012 Revised: October 17, 2012 Published: October 18, 2012 7251
dx.doi.org/10.1021/ef300968k | Energy Fuels 2012, 26, 7251−7256
Energy & Fuels
Article
Table 2. Eleven Key Properties of Dagang Heavy Oil and Four of Its SFEF Fractions
Table 4. Property Range of Oil Samples in the Training and Test Sets
key properties
Dagang
SFEF1
SFEF2
SFEF3
SFEF4
density at 20 °C (g cm−3) CR (wt %) MW H/C N (wt %) S (wt %) Ni (μg/g) V (μg/g) saturates (wt %) aromatics (wt %) resins (wt %)
0.9790
0.9291
0.9328
0.9359
0.9386
key properties
minimum
maximum
minimum
maximum
3.9 858 1.82 0.15 0.24 13.31 0.18 49.2 35.5 15.3
density at 20 °C (g/cm3) CR (wt %) MW H/C N (wt %) S (wt %) Ni (μg/g) V (μg/g) saturates (wt %) aromatics (wt %) resins (wt %)
0.9071 1.4 409 1.46 0.13 0.09 0.51 0.08 16.1 23.9 8.7
1.0045 19.9 1083 1.86 0.98 3.99 91.00 60.60 67.4 57.3 51.1
0.9121 1.9 445 1.62 0.30 0.14 0.59 0.11 35.1 27.4 11.0
0.9543 5.8 804 1.86 0.87 2.70 17.71 2.82 63.4 53.1 21.3
16.3 1083 1.60 0.70 0.24 66.88 1.00 27.8 28.7 43.4
2.2 702 1.84 0.22 0.17 8.45 0.15 57.9 29.3 12.9
2.7 753 1.85 0.14 0.18 10.41 0.16 57.7 30.9 12.0
3.2 806 1.84 0.13 0.20 11.58 0.17 51.5 34.6 13.9
training set
the FCC product yield distribution. Table 1 summarizes each of the FCC product yields and their related significant feedstock properties. Product yield data were also collected for the 38 oil items after the FCC reaction under the same conditions. Tables 2 and 3 display the data collected for Dagang residua and four of its narrow SFEF fractions, including the values of 11 key properties, the FCC product yields, and the used reaction parameters. The full data set is listed in Table S-1 of the Supporting Information. To create optimum models, the property data were needed to standardize prior to modeling to eliminate the weight effect. Data [saved in comma separated value (CSV) file format] were read into R using function write.csv and then were mean-centered and scaled with the function scale. The centered and scaled property data was then divided into the training set (33 items with individual class tags numbered from 1 to 33) and test set (five items numbered T1, T2, T3, T4, and T5). Table 4 presents the value range of the 11 key properties for the training and test sets. To prepare for k-NN modeling, it is also necessary to apply principal component analysis29 (PCA) on the training set and to use the optimum principal components (PCs) to train the model. As a rule of thumb, selected PCs should preserve at least 80% and eventually 90% of the total variance. The PCA was performed with R functions prcomp, summary, and predict. The result suggested that it is optimum to use the first four PCs to build the k-NN model because the cumulative explained variance at this point reached 90.88%. Cumulative contribution values of the first eight PCs are shown in Table 5. To verify this selection, cross-validation analysis has been applied to the training set as an alternative method to estimate the optimal PC number using function pcaCV in R package chemometrics. The result is shown in Figure 1, which indicates that the first four PCs could represent nearly 83% of the explained variance and are indeed optimal for the k-NN modeling in this research. 2.2. Chemometric Pattern Recognition Modeling and Estimation. Two chemometric pattern recognition techniques, k-NN and SSOMs, were used to build the estimation models in this paper. Using statistical classification approaches, the two models established could sort out the oil sample in the training set with the most similar properties to the test sample. The FCC product yields of the best
test set
Figure 1. PCs and explained variance in cross-validation. matched oil will then serve as the yield prediction for the test sample. Below are the detailed modeling and estimation procedures. 2.2.1. k-NN Approach. k-NN is a nonparametric classification method based on Euclidean distances between samples. The k is a number to be decided. If k = 1, only the closest neighbor will be considered and any new sample will be assigned with the class tag of its closest neighbor in the training set.25 The model was trained by the first four PCs of the 11 key properties, with the following line of R code: R: HOD. knn < − knn(HODTrain , HODTest , ClassTrain , k = 1)
where HODTrain stands for the four PCs and ClassTrain represents class tags in the training set. When the value of k is set to 1, each test
Table 3. FCC Product Yield Distribution of Dagang Heavy Oil and Its Four Fractions after SFEF under the Same Reaction Conditions reaction conditions
FCC product yield distribution (wt %)
oil items
catalyst
temperature (°C)
weight ratio of catalyst/oil
gas
gasoline
diesel
HCO
coke
Dagang SFEF1 SFEF2 SFEF3 SFEF4
RHZ-200 RHZ-200 RHZ-200 RHZ-200 RHZ-200
500 500 500 500 500
5 5 5 5 5
18.31 16.38 16.82 17.22 17.18
44.31 53.17 51.35 50.19 48.08
13.68 14.51 14.13 15.07 15.26
11.17 10.69 11.40 10.09 11.31
12.53 5.25 6.30 7.43 8.17
7252
dx.doi.org/10.1021/ef300968k | Energy Fuels 2012, 26, 7251−7256
Energy & Fuels
Article
Table 5. Cumulative Explained Variance of the First Eight PCs of the Training Set PC
1
2
3
4
5
6
7
8
cumulative explained variance
0.5224
0.7562
0.8456
0.9088
0.9618
0.9744
0.9856
0.9945
When the class information is included in the training, the SOMs process becomes supervised and can be applied for classification purposes. When a new sample is presented to the trained map, the training sample that has the shortest Euclidean distance to the input sample will be selected and the class tag of the chosen training sample will serve as the class for the new sample. The model was built with the SSOMs modeling R function xyf using the R code below
Table 6. Classes of the Five Test Samples Determined by k-NN and SSOMs Models test sample, best matched training oil T1, T2, T3, T4, T5, T5,
28 33 23 13 20 21
classification model k-NN/SSOMs k-NN/SSOMs k-NN/SSOMs k-NN/SSOMs k-NN SSOMs
R: HOD. xyf < −xyf (HODTrain , classvec 2classmat (ClassTrain), grid = somgrid(11, 3, “hexagonal”))
sample will be assigned with the class tag of its neighbor with the shortest Euclidean distance. Product yields of the nearest neighbor will be used as the product yield prediction of the test sample. 2.2.2. SSOMs Approach. Self-organizing maps (SOMs) is a mapping procedure to map the high dimensional samples to a onedimensional (1D) or two-dimensional (2D) discrete lattice of neuron units, which preserve the topological relationships between samples.26
Function xyf is available in the Kohonen package of R. HODTrain is the matrix of the training set, with each row representing an oil sample, while ClassTrain indicates the class tag of each training sample. The number of neurons (map units) is set to be 33 (=11 × 3) in a 2D hexagonal lattice and is equal to the number of training samples.
Figure 2. PCA for predicted results of k-NN: (left) 2D and (right) three-dimensional (3D) views.
Figure 3. PCA for predicted results of SSOMs: (left) 2D and (right) 3D views. 7253
dx.doi.org/10.1021/ef300968k | Energy Fuels 2012, 26, 7251−7256
Energy & Fuels
Article
Table 7. Results of Two-Sided t Test and Correlation Analysis for k-NN and SSOMs Models
2.3. Software. All chemometric calculations and modeling were performed by R 2.13.0 (http://www.r-project.org/). Packages used include gplots, rgl, kohonen, and chemometrics.
FCC product yield distribution 11 key properties (test (estimated and actual sample and its best FCC product yields of the test sample) matched training oil) test sample, best matched training oil T1, T2, T3, T4, T5, T5,
28 33 23 13 20 21
classification model
p
coefficient R2
p
coefficient R2
k-NN/SSOMs k-NN/SSOMs k-NN/SSOMs k-NN/SSOMs k-NN SSOMs
0.9803 0.9504 0.9558 0.9802 0.9829 0.9942
0.9998 0.9997 0.9999 0.9999 0.9998 0.9999
1 1 0.9996 0.9481 0.9997 1
0.9982 0.9953 0.9287 0.9962 0.9961 0.9986
3. RESULTS AND DISCUSSION The classification results obtained from the two models are summarized in Table 6. Because each training sample has a unique class tag, the number of classes could be used to refer to the correspondent training oil. The classification results could also be illustrated in the PCA view, as shown in Figures 2 and 3. 3.1. Result Validation. The accuracy for both classification and product yield estimation of the two models was assessed by two-sided t test, correlation analysis, and hierarchical cluster analysis with a heat map. The results prove that the training oil sample whose class tag was assigned to the test sample is indeed the most similar match to the test sample in both properties and FCC product yields among the 33 training oil items. 3.1.1. Two-Sided t Test and Correlation Analysis. Twosided t test (given the significance level α = 0.05) and correlation analysis were carried out using R functions t.test and cor, respectively, to evaluate the difference in the 11 key properties between test samples and their best matched training samples and to assess the variance between the estimated and actual FCC product yields of the test samples. The results of t test p value and coefficient R2 are shown in Table 7.
The test oil samples were then classified by the classes defined in the training set (class tags 1−33) using R function predict with the following line of code, where HODTest represents the test data set:
R: HOD. Predict < − predict(HOD. xyf , newdata = HODTest ) Each test oil sample will be assigned with the class tag of a training sample, which is closest to it in Euclidean distance. The product yields of the selected training sample could then serve as the product estimation of the test sample. However, because the SSOMs model is less stable in its nature but accurate on average, the algorithm has been performed 1000 times in this work to obtain the optimum (most repetitive) classification.
Figure 4. Heat map of heavy oil properties. 7254
dx.doi.org/10.1021/ef300968k | Energy Fuels 2012, 26, 7251−7256
Energy & Fuels
Article
Figure 5. Heat map of heavy oil catalytic cracking reaction products.
Table 8. Comparison of k-NN and SSOMs
In two-sided t test, if the p value is greater than 0.05 while the significance level is equal to 0.05, there is no significance difference between the compared items. In correlation analysis, the coefficient R2 is a value between 0 and 1. The closer that it is to 1, the more similar the compared two items will be. The results summarized in Table 7 show a high degree of similarity in the 11 key properties and product yields of the test samples and their respective closest match in k-NN and SSOMs models, which prove the accuracy of the two models in classification and product yield distribution prediction. 3.1.2. Hierarchical Cluster Heat Map Analysis. The analysis was performed by R function heatmap.2 in package gplots for the 11 key properties and FCC product yields of the entire 38 oil items. The results are displayed as Figures 4 and 5. The maps represent the values of each property or a type of product yield with colors to offer direct visual impression of the difference and similarities between the oil samples. The dendrogram at the left of each map is the hierarchy cluster graph of the 38 oil samples based on the 11 key properties or FCC product yields. The numbers at the right side of the graphs are the class tags or the serial numbers of the samples in each row. Each of the 5 test samples and its closest matches in k-NN and SSOMs models are enclosed in one black box. It can be clearly seen from the maps that the best matched training oils are also among the closest cluster neighbors to the test samples, which indicates that the k-NN and SSOMs models could both achieve satisfactory accuracy in classification and product yield estimation. 3.2. Comparison of k-NN and SSOMs Techniques. A brief comparison in concept and application of the two techniques is shown in Table 8. When used for classification purposes, k-NN
method
linear
parameters to optimize
data need to be autoscaled
direct use in high dimensions
k-NN SSOMs
no no
yes no
yes yes
yes yes
is a simple method based on the Euclidian distances between oil samples. The closer the distance, the more similar the pair of samples will be. In comparison to k-NN, SSOMs is a more complex approach. It generates a training map using artificial neural networks to determine the relationship between unknown samples and the training samples. As approved, both of the methods could properly classify the test samples. However, the SSOMs model is less stable. The one built in this work has been run 1000 times for achieving optimal classification. Therefore, in practical applications, the k-NN technique would offer a simpler, faster, and more affordable solution for FCC product yield prediction based on unknown oil classification than SSOMs. 3.3. Further Reinforcement for Industrial Applications. The models built in this work cannot be applied directly into refineries until the following three aspects are taken into account. The training set needs to be much more enlarged to include rather complete measurements of the heavy oil properties and to include other properties other than the 11 key properties discussed here to accommodate random future oils 7255
dx.doi.org/10.1021/ef300968k | Energy Fuels 2012, 26, 7251−7256
Energy & Fuels
Article
chromatography with flame ionization detection. Anal. Chim. Acta 2007, 589 (2), 247−254. (12) Kim, M.; Lee, Y.-H.; Han, C. Real-time classification of petroleum products using near-infrared spectra. Comput. Chem. Eng. 2000, 24 (2−7), 513−517. (13) Balabin, R. M.; Safieva, R. Z.; Lomakina, E. I. Gasoline classification using near infrared (NIR) spectroscopy data: Comparison of multivariate techniques. Anal. Chim. Acta 2010, 671 (1−2), 27−35. (14) Hur, M.; Yeo, I.; Kim, E.; et al. Correlation of FT-ICR mass spectra with the chemical and physical properties of associated crude oils. Energy Fuels 2010, 24 (10), 5524−5532. (15) Barbeira, P. J. S.; Pereira, R. C. C.; Corgozinho, C. N. C. Identification of gasoline origin by physical and chemical properties and multivariate analysis. Energy Fuels 2007, 21 (4), 2212−2215. (16) Katritzky, A. R.; Fara, D. C. How chemical structure determines physical, chemical, and technological properties: An overview illustrating the potential of quantitative structure−property relationships for fuels science. Energy Fuels 2005, 19 (3), 922−935. (17) Clark, H. A.; Jurs, P. C. Classification of crude oil gas chromatograms by pattern recognition techniques. Anal. Chem. 1979, 51 (6), 616−623. (18) Pasadakis, N.; Kardamakis, A. A.; Sfakianaki, P. Classification of gasoline grades using compositional data and expectation−maximization algorithm. Energy Fuels 2007, 21 (6), 3406−3409. (19) Fonseca, A. M.; Biscaya, J. L.; Aires-de-Sousa, J.; et al. Geographical classification of crude oils by Kohonen self-organizing maps. Anal. Chim. Acta 2006, 556 (2), 374−382. (20) El-Gayar, M. S.; Mostafa, A. R.; Abdelfattah, A. E.; et al. Application of geochemical parameters for classification of crude oils from Egypt into source-related types. Fuel Process. Technol. 2002, 79 (1), 13−28. (21) Johnson, K. J.; Morris, R. E.; Rose-Pehrsson, S. L. Evaluating the predictive powers of spectroscopy and chromatography for fuel quality assessment. Energy Fuels 2006, 20 (2), 727−733. (22) Johnson, K. J.; Rose-Pehrsson, S. L.; Morris, R. E. Monitoring diesel fuel degradation by gas chromatography−mass spectroscopy and chemometric analysis. Energy Fuels 2004, 18 (3), 844−850. (23) Morris, R. E.; Hammond, M. H.; Cramer, J. A.; et al. Rapid fuel quality surveillance through chemometric modeling of near-infrared spectra. Energy Fuels 2009, 23 (3), 1610−1618. (24) Morris, R. E.; Hammond, M. H.; Shaffer, R. E.; et al. The application of chemometric methods to correlate fuel performance with composition from gas chromatography. Energy Fuels 2004, 18 (2), 485−489. (25) Skrobot, V. L.; Castro, E. V. R.; Pereira, R. C. C.; et al. Identification of adulteration of gasoline applying multivariate data analysis techniques HCA and KNN in chromatographic data. Energy Fuels 2005, 19 (6), 2350−2356. (26) Wongravee, K.; Lloyd, G. R.; Silwood, C. J.; et al. Supervised self organizing maps for classification and determination of potentially discriminatory variables: Illustrated by application to nuclear magnetic resonance metabolomic profiling. Anal. Chem. 2009, 82 (2), 628−638. (27) Zhao, S.; Xu, Z.; Xu, C.; et al. Systematic characterization of petroleum residua based on SFEF. Fuel 2005, 84 (6), 635−645. (28) Zhang, Z. G.; Guo, S.; Zhao, S.; et al. Alkyl side chains connected to aromatic units in Dagang vacuum residue and its supercritical fluid extraction and fractions (SFEFs). Energy Fuels 2008, 23 (1), 374−385. (29) Skrobot, V. L.; Castro, E. V. R.; Pereira, R. C. C.; et al. Use of principal component analysis (PCA) and linear discriminant analysis (LDA) in gas chromatographic (GC) data in the investigation of gasoline adulteration. Energy Fuels 2007, 21 (6), 3394−3400.
in refineries. In addition, the variance in reaction conditions should be considered when building the models.
4. CONCLUSION Using pattern recognition technology (k-NN and SSOMs), unknown heavy oil could find its closest match among an existing set of oil records in both properties and product yields. The matching will be optimal when the classification models are built with a large enough training set to cover a vast variety of heavy oils. In future heavy oil intelligent processing systems, it is costeffective and more efficient to use the k-NN classification model rather than the one built by the SSOMs technique.
■
ASSOCIATED CONTENT
S Supporting Information *
The 11 key properties and product yields of the 33 training samples and 5 test samples (Table S-1). This material is available free of charge via the Internet at http://pubs.acs.org.
■
AUTHOR INFORMATION
Corresponding Author
*Telephone: +86-10-8973-3392. E-mail:
[email protected]. Notes
The authors declare no competing financial interest.
■
ACKNOWLEDGMENTS This work was supported by the National Basic Research Program of China (2010CB226901) and the Union Fund of the National Natural Science Foundation of China (NSFC) and the China National Petroleum Corporation (CNPC) (U1162204).
■
REFERENCES
(1) Wold, S. Chemometrics; what do we mean with it, and what do we want from it? Chemom. Intell. Lab. Syst. 1995, 30 (1), 109−115. (2) Wold, S.; Sjöström, M. Chemometrics, present and future success. Chemom. Intell. Lab. Syst. 1998, 44 (1), 3−14. (3) Lavine, B.; Workman, J. Chemometrics. Anal. Chem. 2006, 78 (12), 4137−4145. (4) Lavine, B.; Workman, J. Chemometrics. Anal. Chem. 2008, 80 (12), 4519−4531. (5) Lavine, B.; Workman, J. Chemometrics. Anal. Chem. 2010, 82 (12), 4699−4711. (6) Parisotto, G.; Ferrão, M. F.; Muller, A. L. H.; et al. Total acid number determination in residues of crude oil distillation using ATR− FTIR and variable selection by chemometric methods. Energy Fuels 2010, 24 (10), 5474−5478. (7) Morgan, T. J.; Alvarez-Rodriguez, P.; George, A.; et al. Characterization of Maya crude oil maltenes and asphaltenes in terms of structural parameters calculated from nuclear magnetic resonance (NMR) spectroscopy and laser desorption−mass spectroscopy (LD−MS). Energy Fuels 2010, 24 (7), 3977−3989. (8) Striebich, R. C.; Contreras, J.; Balster, L. M.; et al. Identification of polar species in aviation fuels using multidimensional gas chromatography−time of flight mass spectrometry. Energy Fuels 2009, 23 (11), 5474−5482. (9) de Peinder, P.; Visser, T.; Wagemans, R.; et al. Sulfur speciation of crude oils by partial least squares regression modeling of their infrared spectra. Energy Fuels 2009, 24 (1), 557−562. (10) Nielsen, K. E.; Dittmer, J.; Malmendal, A.; et al. Quantitative analysis of constituents in heavy fuel oil by 1H nuclear magnetic resonance (NMR) spectroscopy and multivariate data analysis. Energy Fuels 2008, 22 (6), 4070−4076. (11) Bodle, E. S.; Hardy, J. K. Multivariate pattern recognition of petroleum-based accelerants by solid-phase microextraction gas 7256
dx.doi.org/10.1021/ef300968k | Energy Fuels 2012, 26, 7251−7256