Energy & Fuels 2007, 21, 2955-2963
2955
Comparison of Probability Distribution Functions for Fitting Distillation Curves of Petroleum Sergio Sa´nchez,† Jorge Ancheyta,*,†,‡ and William C. McCaffrey§ Instituto Mexicano del Petro´ leo, Eje Central La´ zaro Ca´ rdenas 152, D.F. 07330, Me´ xico, Escuela Superior de Ingenierı´a Quı´mica e Industrias ExtractiVas (ESIQIE-IPN), D.F. 07738, Mexico, and Department of Chemical and Materials Engineering, UniVersity of Alberta, Edmonton, Alberta, T6G 2G6 Canada ReceiVed January 2, 2007. ReVised Manuscript ReceiVed June 14, 2007
The fitting capability of 25 probability distribution functions for distillation data of petroleum fractions was analyzed in this work. Rankings of all the functions based on two different approaches were established after a statistical analysis of the fit of the functions with a data set of 137 distillation curves. In general, distribution functions with four parameters showed better fitting capability than those with three parameters. Two-parameter functions were not effective in fitting distillation data. The Weibull extreme, Kumaraswamy, and Weibull functions were found to be the best distribution functions for fitting distillation data considering their ranking and the required CPU time. These distribution functions exhibited the lowest Akaike information criterion and Bayesian information criterion average values, standard deviations lower than 1%, correlation coefficients higher than 0.999, and residuals randomly distributed without any tendency. The fitting capability of the best functions was validated with an independent set of distillation data, and the ranking was the same.
Introduction Specialized characterization of petroleum has been the topic of a number of research papers, which can be from the wellknown and used “assays” to sophisticated studies based on upto-date laboratory techniques, for example, NMR, MS, XRD, and so forth. Since the early 1930s, several studies have discussed the implementation of more accurate characterization methods.1 While sophisticated approaches to characterize petroleum have enhanced our understanding of its structure, traditional characterization methods are still widely employed. Empirical correlations are also very popular to estimate “bulk” properties of petroleum fractions, which are mainly based on distillation curves and specific gravities. These correlations are very useful in process engineering, particularly when the available experimental data are limited.2 In the case of distillation curves, sometimes a limited amount of distillation points are available and it is necessary to interpolate/extrapolate to determine a required value. Several American Society for Testing and Materials (ASTM) methods are commonly used to obtain distillation data: ASTM D-5307, ASTM D-2892, ASTM D-1160, ASTM D-86, and so forth. All of them employ standardized devices and report boiling point temperatures of the sample versus distillation yields on a volumetric and/or a gravimetric basis. Since distillation curves are formed with a finite number of temperature-yield data, they can be fitted to different functions in order to generate a continuous representation. Polynomial regression, cubic spline interpolation, and Lagrange interpolation are all common mathematical tools which * To whom correspondence should be addressed. Fax: (+52-55) 91758418. E-mail:
[email protected]. † Instituto Mexicano del Petro ´ leo. ‡ ESIQIE-IPN. § University of Alberta. (1) Watson, K. M.; Nelson, E. F. Ind. Eng Chem. 1933, 25, 880. (2) Technical data book - petroleum refining; American Petroleum Institute: Washington, DC, 1976.
have been used for interpolating points along the distillation curve. Another approach, which offers more accurate adjustments, is the use of least-square methods for fitting probability distribution functions to distillation data. Probability distribution functions have been utilized in a wide variety of applications: in process simulation software during the creation of pseudocomponents, which are used together with quadrature techniques for determining the optimal number of pseudocomponents for simulation purposes, for example, characterization of petroleum fractions;3-5 for phase equilibrium calculations when continuous thermodynamics methods are applied;6-8 and to describe the extent of chemical transformations occurring during petroleum refining processes, which is a relatively recent application of the probability distribution functions.9,10 Other examples include the description of polymerization reaction products, which are mixtures of compounds with different molecular weights and use in environmental studies for representing the size distribution of particles of dust and aerosols, the rain precipitation per day, the level of rivers and lakes, and other meteorological phenomena.11 Distillation data and specific gravity are the most common properties used as inputs in empirical correlations to characterize petroleum fractions. This characterization is achieved by means of correlations that are useful for determining molecular weight, critical properties, and so forth. They can also be utilized for distinguishing reaction products as pseudocomponents or lumps (naphtha, middle distillates, etc.) of some typical refinery (3) Whitson, C. H. SPE J. 1983, 275, 683. (4) Whitson, C. H.; Anderson, T. F.; Soreide, I. NPRA, New Orleans, LA, March 6-10, 1988. (5) Dhulesia, H. Hydrocarbon Process. 1984, 62, 179. (6) Kehlen, H.; Ratzsch, M. T. Chem. Eng. Sci. 1987, 42, 221. (7) Willman, B.; Teja, A. S. Ind. Eng. Chem. Res. 1986, 26, 948. (8) Peng, D. Y.; Wu, J. P.; Batycky, J. P. AOSTRA J. Res. 1987, 3, 113. (9) Bacaud, R.; Rouleau, L; Bacaud, B. Energy Fuels 1996, 10, 915. (10) Krishna, R.; Saxena, A. K. Chem. Eng. Sci. 1989, 44, 703. (11) Kumaraswamy, P. J. Hydrol. 1980, 46, 79.
10.1021/ef070003y CCC: $37.00 © 2007 American Chemical Society Published on Web 08/08/2007
2956 Energy & Fuels, Vol. 21, No. 5, 2007
processes such as hydrocracking, catalytic cracking, and so forth.12 To have accurate and reliable representations of distillation data for further interpolation, a strict analysis of other approaches apart from the traditional interpolation techniques is mandatory. The main objective of the present work is to contrast the fitting capability of several probability distribution functions to distillation data. Brief Background of Probability Distribution Functions Probability distribution functions were first developed to measure the possibility of the occurrence of a specific event. Depending on the number of possible events, they can be discrete functions, when the number of possible events is discrete, or they can be classified as continuous functions. In the present contribution, only continuous distribution functions are studied. Probability distribution functions are defined for a reduced number of parameters and have simple formulas for calculating mean, mode, variance, and so forth. They have two main forms: the probability density function (PDF) and the cumulative distribution function (CDF). The former is the most commonly used form of the probability distribution functions, a very well-known example of this type of function is the classical Gaussian bell.13 CDFs increase monotonically and generally describe the same behavior that is observed with distillation curves. Due to their simplicity, probability distribution functions are easily included in computer programs for modeling, optimization, and control purposes. It is not recommended to select the distribution function a priori, however, since finding an adequate distribution function that represents the experimental data with minimal error depends on the specific characteristics of each type of function. As was mentioned in the Introduction, several distribution functions have been utilized for calculations related to the petroleum industry. Whitson et al.3,4 proposed a petroleum characterization method based on the three-parameter γ distribution function for the characterization of C7+. Dhulesia5 proposed the Weibull distribution function in its cumulative form to describe ASTM distillation curves of petroleum fractions. The Weibull equation was tested with distillation data of the feed and products of a fluid catalytic cracking unit, and the fitted curves showed good agreement with experimental data. The normal function has been employed in phase equilibrium methods when the continuous thermodynamics approach was used.6 The normal function and the error function were utilized for modeling reaction behavior of the hydrocracking process.9,10 Willman and Teja7 used the bivariate log-normal distribution function for characterizing the composition of mixtures involved during phase equilibrium calculations. The β distribution function has been employed for characterizing petroleum fractions in state equation calculations.8 Exponential and χ2 distribution functions, which are simplified cases of the γ distribution function, have been used to characterize the heavy end of reservoir fluids and to develop phase equilibrium computations.14,15 A modified form of the Weibull distribution was utilized by Riazi16 for establishing a method to predict complete property distributions for the molecular weight, boiling (12) Ancheyta, J.; Sa´nchez, S.; Rodrı´guez, M. A. Catal. Today 2005, 109, 76. (13) Evans, M.; Hastings, N.; Peacock, B. Statistical distributions; John Wiley: New York, 1993. (14) Behrens, R. A.; Sandler, S. I. SPE Res. Eng. 1998, 3, 1041. (15) Luks, K. D.; Turek, E. A.; Kragas, T. R. Ind. Eng. Chem. Res. 1990, 29, 2101. (16) Riazi, M. R. Ind. Eng. Chem. Res. 1989, 28, 1731.
Sa´ nchez et al. Table 1. Probability Distribution Functions Used in This Worka
eq.
function
number of parameters
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
R (normalized) R β Bradford Burr χ fatigue life Fisk Fre`chet folded normal γ generalized extreme value generalized logistic Gumbel half normal Jhonson SB Kumaraswamy log-normal Nakagami normal Riazi Student's t Wald Weibull Weibull extreme
2 4 4 3 4 3 3 3 3 2 3 3 3 2 2 4 4 2 3 2 3 3 2 3 4
a
other functions included PDF
CDF
ref
Φ Φ Γ
Φ Φ I
Γ
Γ Φ
18 18 42 42 42 42 18 42 17 42 42 17 42 17 42 43 11 42 42 42 16 42 42 42 18
Γ
Φ Γ
Φ Γ Γ
Φ Γ Φ I Φ
Φ: CDF normal. Γ: γ function. Ι: incomplete β function.
point, specific gravity, and refractive index of C7+ fractions. The extreme value distributions are a family of equations which have been used in applications involving natural phenomena such as rainfall, floods, air pollution, and corrosion. Gumbel, Fre`chet, and Weibull functions may all be represented as members of a single family of generalized extreme value distribution functions.17 The Kumaraswamy distribution is comparable in characteristics with the β distribution function for their versatility and double-bounded nature; however, the Kumaraswamy distribution has a simpler form than the β distribution for both the PDF and the CDF.11 Even though the Kumaraswamy distribution function was originally developed for hydraulic modeling, it has been applied to describe bounded physical variables encountered in civil engineering. Among all distribution functions found in the literature, only 25 were chosen to be analyzed in this work. The selected distributions are summarized in Table 1. Not included in the list are a few well-known distributions, such as the TukeyLambda, Cauchy, and F distributions,18 because they are either seldom used to model empirical data or they lack a convenient analytical form for the CDF. Even though most of the distributions reported in Table 1 are not widely applied to engineering applications, they all have the potential to be very useful for describing real-world data sets. Definitions of the probability distribution functions in their two main forms (CDF and PDF) are presented in Table 2. The forms of the functions may vary slightly from those reported in the literature. It is also possible that a distribution function could be known with different names. Methodology Data Source. The fitting capability of the 25 selected functions was done using two distillation data sources: those previously (17) Kotz, S.; Nadarajah, S. Extreme Value distributions: theory and applications, Imperial College Press: London, 2000. (18) Heckert, N. A.; Filliben, J. NIST handbook 148; NIST: Gaithersburg, MD, 2003.
Comparison of Probability Distribution Functions
Energy & Fuels, Vol. 21, No. 5, 2007 2957
Table 2. Definitions of Continuous Probability Distribution Functionsa eq 1
probability density function (PDF)
distribution R (normalized) (C, D)
D y Φ(C)x2π 2
2
R (A, B, C, D)
[ 21 (C - Dy) ]
D
β (A, B, C, D)
Bradford(A, B, C)
5
Burr (A, B, C, D)
6
χ (A, B, C)
7
9
Fre´chet (A, B, C)
10
12
generalized extreme value (A, B, C)
13
generalized logistic (A, B, C)
14
Gumbel (A, B)
15
Johnson SB (A, B, C, D)
1-Γ
[x x (
xπ2 cosh(AyB) exp( - 21 y +B A ) 2
2
19
Nakagami (A, B, C)
20
normal (A, B)
Riazi (A, B)
0.5
C
(-yB- A)
Γ(C,t) exp[-(1 + Ct)-1/C]
1 [1 + exp(-t)]c exp[-exp(-t)]
1 exp(-t) exp[-exp(-t)] B
exp
{
y-A C + D ln( [ B - y)] 2
(By -- AA) [1 - (By -- AA) ] C-1
C D-1
}
2
[
Φ C + D ln
( By -- Ay)]
[ (By -- AA) ]
C D
1- 1-
2Φ(t) - 1
x ( ) 2 1 exp - t2 π 2
[ (
)]
1 ln(y) - A 1 exp 2 B Byx2π 2CC 2C-1 t exp(-Ct2) BΓ(C)
B2 B-1 B y exp - yB A A
(
2
(
Φ
)
ln(y) - A B
Γ(C,Ct2)
Φ(t)
1 1 exp - t2 2 x B 2π
( )
21
0.5
Φ(t) - Φ
exp(-t) C B [1 + exp(-t)]c+1
1 B log-normal (A, B)
2
1 (1 + Ct)-1-(1/C) exp[-(1 + Ct)-1/C] B
half normal (A, B)
]
y A -( ) ( A) y Φ
2
1 C-1 t exp(-t) ΒΓ(C)
Kumaraswamy (A, B, C, D)
[
2
exp(-t-C)
C -C-1 t exp(-t-C) B
CD
18
-1
( C2 ,21t )
1 1 + t-C
(y - A)(B - y)x2π
17
)]
y A + -2 A A y exp y 2C2
C tC-1 B (1 + tC)2
D(B - A) 16
]
C(y - A) B-A ln(C + 1)
(1 + t -C)-D
( ) ()
1 B γ (A, B, C)
[
ln 1 +
1 tC-1 exp - t2 2 C (2(C/2)-1B)Γ 2
folded normal (A, B)
11
(By -- AA,C,D)
CD -C-1 t (1 + t-C)-D-1 B
y A
)
I
(B - y)1-D
C [C(y - A) + B - A] ln(C + 1)
y2 - A2 2x2πC2By2 Fisk (A, B, C)
(
Φ C-
(y - A)C-1 C+D-1
)
D t Φ(C)
2
Γ(C + D)
fatigue life (A, B, C)
8
(
Φ C-
[ 21(C - Dt) ]
exp -
Γ(C) Γ(D) (B - A) 4
D y Φ(C)
2
exp -
B2t2Φ(C)x2π 3
cumulative distribution function (CDF)
)
B 1 - exp - yB A
(
)
2958 Energy & Fuels, Vol. 21, No. 5, 2007
Sa´ nchez et al.
Table 2 (Continued) eq
distribution
22
Student’s t (A, B, C)
probability density function (PDF)
1 BxπC 23
a
Wald (A, B)
24
Weibull (A, B, C)
25
Weibull extreme (A, B, C, D)
x
Γ
(C +2 1) 1 + t C ( C) Γ( ) 2
2 -[(C+1)/2]
B y-A B exp 2y A 2πy3
[ (
2
)]
C C-1 exp(-tC) t B CDtC-1 exp(-tC)[1 - exp(1 - tC)]D-1 B
{
cumulative distribution function (CDF)
(
)
1 C C1 I , , ,te0 2 C + t2 2 2 1 C C1 1- I , , ,t>0 2 C + t2 2 2
Φ
(
(x
)
)
( ) (x
By - A 2B + exp Φ y A A
)
B-y - A y A
1 - exp(-tC) [1 - exp(-tC)]D
t ) (y - A)/B
reported in the literature19-37 and data obtained in laboratories at the Mexican Institute of Petroleum and the University of Alberta. The selected samples include whole crude oils; vacuum gasoils; atmospheric and vacuum residua; atmospheric gasoils; light cycle oils (LCO); hydrotreated LCO; feeds of the fluid catalytic cracking (FCC) process; feeds; and products of mild thermal processing, vacuum residue hydrotreating, hydrotreating of bitumen-derived gasoils process, and hydrotreating of middle distillates. The distillation data set was comprised of petroleum samples mainly from Kuwait, Saudi Arabia, Mexico, and Canada. A total of 137 distillation curves were considered in the analysis, each having at least six experimental points, with a total of 1627 temperatureversus-yield points. All experimental distillation data were obtained using standardized methods (physical distillation methods: ASTM D-2892 and ASTM D-1160; simulated distillation methods: ASTM D-5307, ASTM D-6352, and high-temperature simulated distillation) and were collected in a database. Distillation data were not reduced at a unique basis; instead, they were treated on their original basis (ASTM D-2892, ASTM D-1160, ASTM D-5307, etc.). Different units of temperature and product recovery were found in the literature and were transformed to degrees Celsius and weigth percent. Example of Parameter Estimation. The comparison of fitting capability of all functions reported in Tables 1 and 2 was performed by statistical methods. The procedure for parameter estimation is described below, and the four-parameter β-distribution function (eq 3) is taken as an example using a single distillation data set, which corresponds to a simulated distillation curve of hydrocracked Maya crude oil: (19) Ali, F.; Ghaloum, N.; Hauser, A. Energy Fuels 2006, 20, 45. (20) Anabtawi, J. A.; Ali, S. A. Ind. Eng. Chem. Res. 1991, 30, 2586. (21) Aoyagi, K.; McCaffrey, W.; Gray, M. R. Pet. Sci. Technol. 2003, 21, 997. (22) Barman, B. N. Energy Fuels 2005, 19, 1995. (23) Bollas G. M.; Vasalos, I. A.; Lappas, A. A.; Iatridis, D. K.; Tsioni, G. K. Ind. Eng. Chem. Res. 2004, 43, 3270. (24) Chen, Y. W.; Tsai, M. C. Ind. Eng. Chem. Res. 1997, 36, 2521. (25) Espinosa, M.; Figueroa, Y.; Jimenez, F. Energy Fuels 2004, 18, 1832. (26) Laredo, G. C.; Lo´pez, C. R.; Alva´rez, R. E.; Castillo, J.; Cano, J. L. Energy Fuels 2004, 18, 1687. (27) Lenoir, J. M.; Hipkin H. G. J. Chem. Eng. Data 1973, 18, 195. (28) Marafi, A.; Al-Bazzaz, H.; Al-Marri, M.; Maruyama, F.; AbsiHalabi, M.; Stanislaus, A. Energy Fuels 2003, 17, 1191. (29) Marroquin, G.; Ancheyta, J.; Ramı´rez, A.; Farfan, E. Energy Fuels 2001, 15, 1213. (30) Maw, S. C.; Heldman, J. L.; Hwang, S. C.; Tsonopoulos, C. Ind. Eng. Chem. Process Des. DeV. 1984, 23, 577. (31) Michael, G.; Al-Siri, M.; Khan, Z. H.; Ali, F. A. Energy Fuels 2005, 19, 1598. (32) Owusu-Boakye, A.; Dalai, A. K.; Ferdous, D.; Adjaye, J. Energy Fuels 2005, 19, 1763. (33) Rousis, S. G.; Fitzgerald, W. P. Anal. Chem. 2000, 72, 1400. (34) Sarma, A. K.; Konwer, D. Energy Fuels 2005, 19, 1755. (35) Schwartz, H. E.; Brownlee, R. G.; Boduszinski, M. M.; Su, F. Anal. Chem. 1987, 59, 1393. (36) Ukwuoma, O. Pet. Sci. Technol. 2002, 20, 525. (37) Yui, S. M.; Ng, S. H. Energy Fuels 1995, 9, 665.
Step 1. Temperature data were changed to a dimensionless form using the following equation: θi )
Ti - T 0 T1 - T0
(26)
where θi is the dimensionless temperature, Ti is the actual temperature boiling point, and T0 and T1 are reference temperatures, which are chosen to have θi values between 0 and 1 (T0 ) 30 °C and T1 ) 1000 °C in this work). The dimensionless distillation curve together with the original distillation data of a selected sample are shown in Figure 1. In the case of dimensionless distillation data, the values neither start at zero nor finish at one since the reference temperatures, T0 and T1, covered a wider range of values than those of the selected sample. Step 2. An optimization method38 was applied for obtaining the optimal set of parameters of the probability distribution function. The optimization criterion was the minimization of the residual sum of squares (RSS) defined by eq 27. RSS )
∑(y
exp,i
- ycal,i)2
(27)
where yexp,i and ycal,i are the experimental and calculated weight fractions, respectively. The optimal set of parameters using a β distribution function for the data given in Figure 1 was A ) 0.089 62, B ) 1.050 13, C ) 2.490 03, and D ) 6.341 86. To be sure about the precision of the estimated parameters and convergence to a global minimum, a sensitivity analysis was conducted using an approach reported elsewhere.39 Step 3. Predicted values of liquid recovery using the distribution function with the optimal values of parameters were obtained. The results from the model are also shown in Figure 1.
Figure 1. Experimental (O) and predicted (__) distillation values with β function (hydrocracked Maya crude oil).
Comparison of Probability Distribution Functions
Energy & Fuels, Vol. 21, No. 5, 2007 2959 a penalty term (2k) that is an increasing function of the number of parameters; this feature makes it very useful for the comparison of models with different numbers of parameters. In this study, the preferred probability distribution function will be that with the lowest AIC value. The expression to calculate the Bayesian information criterion for models with randomly distributed residuals is BIC ) kln(n) + nln
Figure 2. Comparison of experimental and predicted values using β function for the testing data set.
Step 4. A statistical analysis and residuals analysis using predicted and experimental values were carried out in order to identify possible model errors. For data of the example, values of maximum absolute error, average absolute error, RSS, and standard deviation (SD) were 1.19, 1.36, 5.34, and 0.56, respectively. Step 5. In most of the cases, the largest errors were found at the extreme points of the distillation curves. This can be due to the low accuracy of the experimental measurements in these parts of the curve. That is why, for practical purposes, initial boiling point (IBP) and final boiling point (FBP), or even 5% and 95% distillation points, are commonly excluded from calculations. Parameter Estimation for All Distribution Functions. The procedure previously described was employed for the parameter estimation of all distribution functions given in Table 2 for each of the 137 distillation data sets. As an example, comparisons of experimental and predicted values from the β-distribution function are shown in Figure 2, in which a visual analysis and predictive capability of the function can be established. Correlation coefficient (R2), slope, intercept, and standard deviation were obtained by statistical analysis of the parity plot of experimental versus calculated values of liquid recovery. A summary of the statistical parameters derived from a regression analysis of the parity plot is presented in Table 3. This table gives more quantitative analysis of the predictive capability of the β function. The predictive capability of the different functions was classified according to their statistical indicators. First, a methodology based on regression analyses was applied, in which SD as calculated by eq 28 was the main criteria for establishing the ranking; the correlation coefficient (R2), slope, and intercept were also considered. SD )
xnRSS -2
(28)
A second approach took into consideration both the Akaike and Bayesian information criteria (AIC and BIC, respectively). The AIC is an operational way of considering both the complexity of a model and how well it fits the data.40 The AIC methodology attempts to find the model that best explains the data with a minimum of free parameters. When residuals are randomly distributed, the AIC is calculated as AIC ) 2k + nln
(RSS n )
(29)
where k is the number of parameters, n is the number of observations, and RSS is the residual sum of squares. AIC includes (38) Marquardt, D. W. J. Soc. Ind. Appl. Math. 1963, 2, 431. (39) Alcazar, L. A.; Ancheyta, J. Chem. Eng. J. 2007, 128, 85. (40) Burnham, K. P.; Anderson, D. R. Model selection and multimodel inference, 2nd ed.; Springer-Verlag: New York, 1998.
(RSS n )
(30)
Compared to the AIC, the BIC penalizes free parameters more strongly. In the same way as that using the AIC, the model with the lowest value of BIC is the one that is preferred. Since AIC is strongly dependent on sample size, it is recommended to use relative values and, particularly, the AIC differences (∆i, given by eq 31) for selecting a model. Models with ∆i > 10 may be considered as failing to explain a substantial variation in the data and may be omitted from further consideration. ∆i ) AICi - AICmin
(31)
Alternatively, Burnham and Anderson40 proposed the use of AIC weights (ωi) for model selection, which is considered as evidence that model i is the best model for a given situation among all models. The evidence ratios (ω1/ωj) are used to compare two different models, where model 1 is the estimated best model and j indexes the rest of the models in the set. AIC weights and evidence ratios are calculated by
ωi )
( ) ( )
1 exp - ∆i 2 R
∑ r)1
(32)
1 exp - ∆r 2
( )
ω1 1 ) exp ωj 2∆j
(33)
Results and Discussion To determine the best distribution function to describe distillation curves, the various functions were first fitted to the experimental data. The largest errors during data fitting were obtained with boiling points close to the end of the distillation curves followed by those close to the beginning. This problem was particularly evident for the normal and Student’s t distribution functions, which are symmetrical. The difficulty in fitting distribution functions to IBP and FBP data is compounded by the larger experimental error associated with the endpoints of distillation curves. These errors are associated with the sensitivity of experimental devices when initializing or finalizing the tests and are observed regardless of whether the equipment is operated manually or automatically. The experimental error is variable, depending upon the standardized method that is employed; for instance, in the ASTM D-2892 method, errors up to 1.2 wt % for the volume recovery are tolerated, whereas in the ASTM D-1160 method, errors can range from 1.7 up to 5.7 wt % for the different points in the distillation curve. Selecting the “best” distribution function is not a trivial task. A wide variety of statistical data can be used in this duty, including standard deviations, R2, Akaike and Bayesian information criteria, and even CPU time, which are all presented in Table 4. It is well-accepted that correlation coefficients are not very useful in discriminating between models. In this study, the correlation coefficients were very close to unity (0.9860.999) for all of the functions. To highlight this point, only the R distribution function (eq 1) exhibited a value of R2 lower than
2960 Energy & Fuels, Vol. 21, No. 5, 2007
Sa´ nchez et al.
Table 3. Main Statistical Parameters from the Regression Analysis of β Distribution Regression statistics correlation coefficient, R2 standard deviation, wt % observations
slope intercept analysis of variance regression residual total
0.999 0.814 1474
coefficients
lower 95%
upper 95%
1.002 -0.0655
1.0033 0.0062
1.0002 -0.1372
Df
sum of squares
mean square
1 1473 1474
1.0961 × 973.66 1.0971 × 106
1.0961 × 0.6619
106
106
F 1.6560 × 106
Table 4. Ranking of All Distribution Functions equation
R2
slope
intercept, wt %
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
0.986 0.998 0.999 0.996 0.995 0.998 0.995 0.998 0.996 0.995 0.998 0.996 0.998 0.995 0.993 0.998 0.999 0.996 0.997 0.994 0.996 0.994 0.995 0.999 1
1.023 1.001 1.002 1.007 1.026 1.001 1.014 0.998 1.002 0.987 1.003 1.002 0.996 1.000 1.016 1.001 1.004 1.012 1.000 0.986 0.997 0.984 1.015 1.002 1.003
-1.47 0.08 -0.01 -0.37 -0.60 0.06 -0.71 0.22 -0.17 0.90 -0.09 -0.17 0.37 -0.07 -0.92 0.02 -0.05 -0.61 0.22 0.91 0.45 1.00 -0.79 0.01 -0.06
a
SD, wt %
SD-based ranking
average AIC
average BIC
AIC-based ranking
CPU timea
3.374 1.276 0.814 1.778 1.992 1.182 1.911 1.264 1.801 1.876 1.072 1.797 1.132 1.84 2.397 1.126 0.885 1.849 1.393 2.093 1.673 2.064 1.987 0.865 0.599
25 10 2 13 21 8 19 9 15 18 5 14 7 16 24 6 4 17 11 23 12 22 20 3 1
31.79 3.62 -14.77 10.66 -0.41 -4.04 19.02 4.76 7.96 15.34 -5.87 7.67 2.54 7.78 21.65 -3.90 -13.19 15.55 -0.49 18.34 2.03 19.66 18.20 -10.44 -16.61
32.57 5.19 -13.20 11.84 1.15 -2.86 20.19 5.93 9.14 16.12 -4.69 8.84 3.72 8.56 22.43 -2.34 -11.62 16.34 0.69 19.12 3.21 20.83 18.98 -9.26 -15.04
25 12 2 17 9 6 22 13 16 18 5 14 11 15 24 7 3 19 8 21 10 23 20 4 1
2.267 2.143 3.705 0.219 0.124 1.924 0.724 0.114 0.143 1.257 3.400 0.124 0.114 0.086 0.657 0.771 0.219 0.838 2.086 1.000 0.124 4.381 1.305 0.124 0.248
CPU time relative to that required with normal distribution function (eq 20).
0.99. A parity plot is presented in Figure 2, and the slopes and intercepts of the parity plots for each model are included in Table 4. The slopes of the experimental versus predicted values plots are in all cases virtually unity (0.984-1.026), and intercepts range between -1.47 and +1.00. A more useful technique to eliminate potential models was to identify which distributions yielded nonrandom residuals. Nearly all of the twoparameter models had trends in their residuals. The twoparameter models that were eliminated due to trends in the residuals were normalized R (eq 1), Fre`chet (eq 9), folded normal (10), half normal (eq 15), log-normal (eq 18), normal (eq 20), Student’s t (eq 22), and Wald (eq 23). Additionally, the three-parameter models that were eliminated due to trends in their residuals were the fatigue life (eq 7) and the generalized extreme value (eq 12) models. Interestingly, the Gumbel distribution, eq 14, was the only two-parameter model to display random residuals. Not surprisingly, all of the four-parameter models were effective in describing the experimental data. For comparison purposes, Figure 3 presents the residuals analysis for the worst (two-parameter R function, eq 1) and best (fourparameter Weibull extreme function, eq 25) functions. The differences and precision of estimations are very clear; while residuals for eq 25 ranged between -5 and +5 and were randomly distributed, those for eq 1 varied from a -15 to +10 with a very pronounced trend.
One method to rank the models is to compare the standard deviations. Since the standard deviation values for all functions ranged from 0.59 to 3.37 wt %, they were more useful than the R2 values or the slopes and intercepts from the parity plots. From the results given in Table 4 (R2, slope, intercept, and SD) and residual analysis, the following classification of accuracy of predictions was established: Group 1: SD < 1.0; 0.999 < R2 < 1; number of functions ) 4 (eqs 3, 17, 24, and 25) Group 2: 1.0 < SD < 1.992; 0.995 < R2 < 0.998; number of functions ) 11 (eqs 2, 4, 5, 6, 8, 9, 11, 13, 14, 16, 19, and 21) Group 3: trends in residuals; number of functions ) 10 (eqs 1, 7, 10, 12, 15, 18, 20, 22, and 23) Functions of group 1 are the most accurate, and those of group 3 do not adequately describe the functionality of the distillation data. Model selection should be based not solely on goodness of fit but also on the degree of confidence of the predicted parameters. It is well-known that increasing the number of free parameters to be estimated can improve the goodness of fit but can also decrease the confidence in the estimates of the model parameters. Therefore, ranking the models solely on the basis of SD data may not be satisfactory for comparing functions with different numbers of parameters and different sample sizes. In
Comparison of Probability Distribution Functions
Energy & Fuels, Vol. 21, No. 5, 2007 2961
Figure 4. Average AIC vs sample size. The light-colored bars represent the γ function, whereas the dark-colored bars represent the Weibull extreme function.
Figure 3. Residual plot for (a) normalized R and (b) Weibull extreme distribution functions.
this work, AIC and BIC were used to take into account the different numbers of parameters in the probability distribution functions when ranking the best functions to use to describe distillation data. A value of AIC was calculated for each set of data for every function, and a new ranking was determined using the average AIC of each function. A similar procedure was also applied for calculating an average BIC value and to rank each model. Both the AIC and BIC values are presented in Table 4. Even though the BIC penalizes functions with more free parameters, the BICbased ranking was very similar to the AIC-based ranking, with only the Gumbel and generalized extreme value functions exchanging places in the ranking. It should be noted that the four best-ranked functions using AIC, the Weibull extreme, β, Kumaraswamy, and Weibull functions, are the same as those identified by the SD-based rankings (group 1). The only difference between the top functions in the AIC- and SD-based rankings is in the order. The question remains if there is a significant difference in the ability of the different probability distribution functions to describe the distillation data. It is generally not recommended to apply null hypothesis testing to information-theoretic ranking data to determine if the “best” model is significantly better than
any of the lower-ranked models.40 Model selection is best achieved through an inspection of evidence ratios and residuals. A summary of the AIC and evidence ratios of the 10 best-ranked functions is presented in Table 5. It can be seen that the differences among the four best-ranked functions (∆i from 1.84 to 6.17) are not high enough to conclude unambiguously that there is a single best model from the top four ranked models. Since the ∆i values of the models are fairly similar, variation in the selection of the best model is expected from data set to data set. In the case of distillation data, a priori selection of the model is not recommended, and instead the Weibull extreme, β, Kumaraswamy, and Weibull functions should all be evaluated. A similar conclusion can be reached using the information given by the evidence ratios for the four best-ranked functions, but the high value of the evidence ratio for the Weibull function (wj/w1 ) 21.9) makes it very unlikely that this model was the best. The distillation data sets used in this study ranged in size from 6 to 19 data points in each set. In order to examine if sample size had any impact on model selection, AIC data versus the sample size were plotted for each function. Figure 4 shows the results for the Weibull extreme (rank 1) and γ (rank 5) functions. Two clearly defined groups are formed, both with similar general trends (sample size of 6-10 and sample size of 11-19). The average AIC values for each group for each function were calculated and ordered. The new ranking changed only slightly. Importantly, the group of the four best-ranked functions remained unchanged. The CPU time spent during the parameter optimization process for each function was calculated for each function. The results are included in Table 4 as relative values of computing time with respect to that required for the normal distribution function. Focusing on the functions of group 1 (ranks 1-4), it can be seen that the Weibull extreme (eq 25), Kumaraswamy
Table 5. ∆i and Evidence Ratios of the Best 10 Ranked Functions AIC-based ranking
function (parameters)
eq
average AIC
∆i
ωi
ωj/ω1
1 2 3 4 5 6 7 8 9 10
Weibull extreme (4) β (4) Kumaraswamy (4) Weibull (3) γ (3) χ (3) Jhonson SB (4) Nakagami (3) Burr (4) Riazi (3)
25 3 17 24 11 6 16 19 5 21
-16.61 -14.77 -13.19 -10.44 -5.87 -4.04 -3.90 -0.49 -0.41 2.03
0 1.84 3.42 6.17 10.75 12.58 12.71 16.12 16.20 18.64
0.61213 0.24387 0.11049 0.02794 0.00284 0.00114 0.00106 0.00019 0.00019 0.00005
1.0 2.5 5.5 21.9 215.5 538.1 575.2 3171.9 3289.8 11178.2
2962 Energy & Fuels, Vol. 21, No. 5, 2007
Sa´ nchez et al.
Table 6. Statistical Parameters for Regression Analysis for Data Set Validation
eq R2 SD slope average AIC average BIC ∆i evidence ratio positive residuals negative residuals absolute differencea a
Weibull extreme
Kumaraswamy
Weibull
β
R (normalized)
25 0.994 2.38 1.004 ( 0.016 18.11 20.18 0 1 164 182 18
17 0.994 2.43 1.009 ( 0.016 18.54 20.61 0.43 1.24 167 179 12
24 0.994 2.54 1.006 ( 0.017 19.34 20.89 1.23 1.85 172 174 2
3 0.993 2.75 1.008 ( 0.009 19.92 21.99 1.81 2.47 176 170 6
1 0.984 4.10 1.010 ( 0.028 36.09 37.13 17.99 8048 195 151 44
Absolute difference between positive and negative residuals.
(eq 17), and Weibull (eq 24) functions require similar computing time (0.248, 0.124, and 0.219, respectively), while the required CPU time for evaluating the β distribution function (eq 3, rank 2) is more than 10 times longer (3.705). This can be explained by the evident relative simplicity of the Weibull extreme, Weibull, and Kumaraswamy distribution functions, which do not include any special function as in the case of the β function (Table 1). The following observations, based on the number of parameters (two, three, or four) of each distribution function, can be made: • Four-parameter distribution functions offer the best fitting capability. Five of them are ranked among the top 10. The Weibull extreme, β, and Kumaraswamy distributions are in the best-ranked group. • Some of the three-parameter distribution functions can fit distillation data with good accuracy: the Weibull and γ distributions have standard deviations of 0.86 and 1.07% and are ranked within the best five. • Two-parameter distribution functions exhibited poor fitting capability. All but one of them are in group 3. • It must be noticed that the γ and normal distribution functions, which are the most popular distribution functions used for fitting distillation data, were ranked 5 and 20, respectively. For validation purposes, fitting capabilities of the four best functions and the worst function (eqs 25, 17, 24, 31, and 1) were determined using other data sets. A total of 30 samples, which are from three whole crude oils and their various boiling range fractions, with a total of 346 points were selected for this task. They cover a wide range of distillation temperatures (from 20 to 540 °C). The validation results are presented in Table 6. Residuals for the Weibull extreme and normalized R functions are shown in Figure 5. It can be seen in Table 6 that the standard
Figure 5. Residual plot for validation data set. (+) Weibull extreme; (O) normalized R.
Table 7. Composition of Products from Hydrocracking at P ) 10 MPa and LHSV ) 0.5 h-1 (El-Kady41) Product
1
2
3
reactor temperature, °C
410
430
450
10.42 6.92 15.26 22.84 28.26 16.30
17.23 19.03 36.03 11.15 13.75 2.81
gases (C2-C5) light naphtha (IBP-80 °C) gasoline (80-150 °C) kerosene (150-250 °C) gasoil (250-380 °C) residue (380-538 °C) IBP 5% 10% 30% 50% 70% 90% 95% FBP
Reported Yields 5.90 3.51 9.07 21.23 25.28 35.01
Estimated Distillation Data of Liquid Product, °C 36.0 36.0 86.4 60.9 125.6 89.7 228.1 174.4 321.9 247.6 401.9 325.0 479.9 428.2 505.7 468.9 538.0 538.0
36.0 46.3 56.5 87.8 122.4 174.9 307.9 394.4 538.0
deviations and residuals using the four best equations are lower than those obtained with the normalized R distribution function (SD of about 2.5 versus 4.1). The correlation coefficients and slopes of the parity plots are closer to unity and the intercepts closer to zero for the four best functions as compared to the worst function. Additionally, the absolute difference between the number of positive and negative residuals of the normalized R function is more than twice that of the other functions, which means that the former is overestimating the experimental values. An inspection of the AIC and BIC values, ∆i, and evidence ratios yielded the same order in the ranking from the validation set as from the testing data set. These validation results corroborate that Weibull extreme, Kumaraswamy, and Weibull are the best distribution functions to fit distillation data. To illustrate one application of fitting distribution functions to experimental distillation data, a data set of hydrocracking products of vacuum gas oil (distillation range, 380-550 °C; molecular weight, 425 g/mol; density at 15 °C, 0.931 g/mL), obtained in a fixed-bed reactor at 10 MPa and 0.5 h-1 liquid hourly space velocity (LHSV), was taken from the literature.41 The reported composition data of various products are detailed in Table 7. The complete distillation curves of the whole hydrocracking products were not reported; however, they can be reproduced from yields and temperature ranges of products by using distribution functions. The IBP of naphtha was assumed (41) El-Kady, F. Y. Indian J. Technol. 1979, 17, 176. (42) McLaughlin, M. P. A compendium of common probability distributions. http://www.causascientia.org/math_stat/Dists/Compendium. pdf (accessed Mar 2007). (43) Johnson, R.; Kotz, S.; Balakrishnan, M. Continuous UniVariate Distributions; John Wiley and Sons: New York, 1994.
Comparison of Probability Distribution Functions
Energy & Fuels, Vol. 21, No. 5, 2007 2963
approaches. It was possible to identify a set of four probability distribution functions, which correlate the data within experimental error. Additionally, the required CPU time and simplicity were taken into account as a final criterion to select the most suitable distribution functions. Weibull extreme, Weibull, and Kumaraswamy probability distribution functions are recommended for fitting distillation data, whose application for this purpose has not been previously reported. Further work is necessary to correlate the features of these best-ranked functions with the nature of distillation curves of petroleum. Nomenclature
Figure 6. Comparison of Weibull extreme distribution function (s) and Hermite interpolation method (- -) for representing experimental distillation data of products from hydrocracking at different temperatures: (b) 410 °C, (9) 430 °C, and (2) 450 °C (data from El-Kady41).
to be that of n-C5 so that a total of six points was available. The procedure previously described was applied and a set of optimal parameters for the Weibull extreme distribution function was determined, and a complete distillation curve was generated from only partial data, which is also reported in Table 7. This procedure and data were successfully applied for kinetic modeling of the hydrocracking in which a complete distillation curve was needed.12 For comparison purposes, the results of the complete distillation curves obtained by using the Weibull extreme probability distribution function were plotted together with those determined by the common interpolation method (piecewise cubic Hermite interpolation), and the results are presented in Figure 6. The Hermite interpolation method was selected because it preserves monotonicity and the shape of the data. It can be clearly observed that the distillation curves obtained with cubic interpolation show “humps”, although passing through all of the experimental points; this feature is not typically observed in distillation curves. On the contrary, the Weibull extreme probability distribution function provides a very good fit to the shape of the distillation curve and experimental points. It is worthy to mention that the Weibull function was previously recommended by Dhulesia to describe distillation curves of feeds and products of the FCC process.5 Conclusion The probability distribution functions in their cumulative forms are very useful in general for fitting distillation data. On the basis of statistical analyses of 25 functions and 1474 distillation data points, it was possible to establish a ranking of fitting capability of the functions according to two approaches: (1) with standard deviation, a correlation coefficient, and a residuals analysis and (2) with the AIC and BIC methodology. Even when SD introduces a bias due to the number of parameters, the rankings obtained are quite similar with both
A, B, C, D ) Distribution parameters AIC ) Akaike information criterion AICi ) AIC for model i in eq 31 AICmin ) AIC for best model in eq 31 BIC ) Bayesian information criterion k ) Number of free parameters n ) Number of observations R ) Number of models considered in the study in eq 32 R2 ) Correlation coefficient RSS ) Residual sum of squares SD ) Standard deviation t ) Independent variable in eqs 1-25 T ) Actual temperature boiling point T1, T2 ) Reference temperatures y ) Independent variable in eqs 1-25, t ) (y - A)/B ycal ) Calculated weight fraction yexp ) Experimental weight fraction Acronyms CDF ) Cumulative distribution function FBP ) Final boiling point IBP ) Initial boiling point MS ) Mass spectroscopy NMR ) Nuclear magnetic resonance PDF ) Probability density function XRD ) X-ray diffraction Greek symbols ∆i ) AIC differences for model i with respect to best model Ι ) Incomplete beta function Γ ) Gamma function Φ ) Normal cumulative distribution function θ ) Dimensionless temperature ωi ) AIC weight for model i ω1/ωj ) Evidence ratio of model j with respect to model 1 Acknowledgment. The authors thank the Mexican Institute of Petroleum for economic support. Discussions with Fraser Forbes on the AIC and BIC rankings are also appreciated. Supporting Information Available: Entire experimental data set used for determining the fitting capability of probability distribution functions. This information is available free of charge via the Internet at http://pubs.acs.org. EF070003Y