Minimizing the Effects of Collinearity in Polynomial Regression

The MATLAB program (Math Works, 1992) was used for calculations related .... For this transformation, the maximal range is (wi − w0)max ∼ 0.5 and ...
0 downloads 0 Views 197KB Size
Ind. Eng. Chem. Res. 1997, 36, 4405-4412

4405

Minimizing the Effects of Collinearity in Polynomial Regression Modechai Shacham* Department of Chemical Engineering, Ben-Gurion University of the Negev, Beer-Sheva 84105, Israel

Neima Brauner School of Engineering, Tel-Aviv University Israel, Tel-Aviv 69978, Israel

Data transformation for obtaining the most accurate and statistically valid correlation is discussed. It is shown that the degree of a polynomial used in regression is limited by collinearity among the monomials. The significance of collinearity can best be measured by the truncation to natural error ratio. The truncation error is the error in representing the highest power term by a lower degree polynomial, and the natural error is due to the limited precision of the experimental data. Several transformations for reducing collinearity are introduced. The use of orthogonal polynomials provides an estimation of the truncation to natural error ratio on the basis of range and precision of the independent variable data. Consequently, the highest degree of polynomial adequate for a particular set of data can be predicted. It is shown that the transformation which yields values of the independent variable in the range of [-1,1] is the most effective in reducing collinearity and allows fitting the highest degree polynomial to data. In an example presented, the use of this transformation enables an increase in the degree of the statistically valid polynomial, thus yielding a much more accurate and well-behaved correlation. 1. Introduction Mathematical modeling and simulation of physical phenomena requires, in addition to an accurate model, precise equations to represent pertinent physicochemical properties as a function of temperature, pressure, composition, etc. Such equations require fitting some parameters by regression of experimental data. The accuracy of simulations of physical phenomena critically depends on the accuracy of these correlation equations. Modern regression techniques allow derivation of equations and parameters which can predict values within the experimental error. Collinearity among the original independent variables may prevent reaching this goal. The problem of collinearity has been addressed by means of variable transformation, ridge regression, principal component regression, shrunk estimates, and partial least squares (for a brief review and list of references for the various methods, see for example Wold et al., 1984). Belsley (1991), Bradley and Srivastava (1979), and Seber (1977) discuss the problems that can be caused by collinearity in polynomial regression and suggest certain approaches to reduce the undesired effects of collinearity. Unfortunately, the effects of collinearity are not taken into account in published correlations of various thermophysical properties (see for example Daubert and Danner (1987) or Reid et al. (1977)). As a result, the correlations may either contain an insufficient number of parameters to represent the data accurately or too many parameters. If there are too many parameters, the correlation becomes ill-conditioned, whereby adding or removing experimental points from the data set may drastically change the parameter values. Also, derivatives are not represented correctly, and extrapolation may yield absurd results even for a small range of extrapolation. * Author to whom correspondence should be addressed. Telephone: 972-7-6461767. Fax: 972-7-6472916. E-mail: [email protected]. S0888-5885(97)00236-4 CCC: $14.00

In this paper, we limit the discussion to polynomial regression, but the results can be readily extended to other forms of regression equations. The effects of collinearity in polynomial regression are described, and several transformations which reduce collinearity are presented. It is shown that reducing collinearity allows a more accurate representation of the experimental data. The various transformations are compared using theoretical analysis and numerical experimentation. Practical application of the theoretical results is demonstrated by fitting a polynomial equation to data of heat capacity versus temperature for solid HBr. The numerical calculations were carried out using the regression program of the POLYMATH 4.0 package (Shacham and Cutlip, 1996). The MATLAB program (Math Works, 1992) was used for calculations related to matrices. 2. Collinearity in Least Squares Error Regression Let us assume that there is a set of N data points of a dependent variable yi versus an independent variable xi. An nth order polynomial fitted to the data is of the following form:

yi ) β0 + β1xi + β2xi2 + ... + βnxin + i

(1)

where β0, β1, ..., βn are the parameters of the model and i is the error in yi. The vector of estimated parameters βˆ T ) (βˆ 0, βˆ 1, ..., βˆ n) is usually calculated via the least squares error approach, by solving the normal equation:

XTXβˆ ) XTy

(2)

The rows of X are xi ) 1, xi, ..., xin and XTX ) A is the normal matrix. One of the assumptions of the least squares error approach is that there is no error in the independent variables. However, this is rarely true. The precision of independent variables is limited due to limitations of the measuring and control devices. © 1997 American Chemical Society

4406 Ind. Eng. Chem. Res., Vol. 36, No. 10, 1997

Thus, the value of an independent variable can be represented by

xi ) xˆ i + δi

(3)

where xˆ i is the expected measured value of xi and δi is the error (or uncertainty) in its value. The least squares error approach can be applied in a way that considers the error in both the dependent and independent variables (see for example p 87, in Mandel, 1991), but this will usually have very little effect on the calculated values of βˆ . Nevertheless, the error in the independent variable plays an important role in determining the highest degree of the polynomial and the number of parameters that can be fitted to the data. Collinearity among the different variables can severely limit the accuracy of a regression model. A typical consequence of collinearity is that adding or removing a single data point may cause very large changes in the calculated parameter values. This effect is usually called “ill-conditioning” of the regression equation. A collinearity is said to exist among the columns of X ) [x1, x2, ..., xn] if for a suitable small predetermined η > 0 there exist constants c1, c2, ..., cn not all of which are zero, such that (Gunns, 1984)

c0x0 + c1x1 + c2x2 + ... + cnxn ) ∆; with ||∆|| < η‚||c|| (4) In the case of polynomial regression xj ) xj. This definition cannot be used directly for diagnosing collinearity because it is not known how small η should be so that the harmful effects of collinearity will show. Collinearity has been traditionally expressed by the condition number: κ(A) of the normal matrix. The condition number is defined as

κ(A) ) ||A|| ||A-1||

(5)

Various matrix norms can be used in eq 5. The one commonly used in numerical analysis and statistical literature is the maximal eigenvalue (see for example Belsley, 1991, p 54) in which case κ(A) is the ratio of the largest eigenvalue (λmax) to the smallest eigenvalue (λmin). Stronger collinearity is indicated by a higher value of the condition number which, in turn, causes amplification of the errors i and δi in the calculation of the parameter values of the regression equations. Ideally, κ(A) should be close to 1, but in regression, using different functions of the same independent variable, κ(A) is usually larger by several orders of magnitude. Thus, errors in the data are amplified considerably. For very large values of the condition number, a small and insignificant change in the data may cause a very large change in the calculated parameter values. A large condition number also causes inflation of the diagonal elements of the (XTX)-1 matrix. Those elements are used in the calculation of the confidence intervals on the parameter values obtained by the following equation:

βˆ i - t(ν, R)sxaii e βi < βˆ i + t(ν, R)sxaii

(6)

where aii is the diagonal element of the inverse normal matrix, t(ν, R) is the statistical distribution corresponding to ν degrees of freedom and a desired confidence level, R, and s is the standard error of the estimate.

Larger confidence intervals indicate that a small change in the data can affect large changes in the calculated parameter values. Of particular significance is the case when the value βi ) 0 is included in the interval defined by eq 6. This happens when βˆ i is smaller (in its absolute value) than the term t(ν, R)sxaii. In such a case, there is no statistical justification to include the associated term in the correlation equation. Both the condition number and the confidence intervals have certain drawbacks in using them as collinearity indicators. For the condition number, there is no definite cut-off value that can indicate when collinearity reaches a harmful level. Confidence intervals rely on both independent and dependent variable data and it is difficult to pinpoint whether too wide intervals result from collinearity or imprecision of the data. Therefore, a new collinearity indicator, the “truncation-to-natural error ratio” is herein introduced. Let us assume that we are considering increasing the degree of the regression polynomial by 1, from a degree j - 1 to a degree j. The jth degree of xi can be represented by a lower degree polynomial:

xij ) R0 + R1xi + R2xi2, ..., Rj-1xij-1 + Ei,j

(7)

where R0, R1, R2, ..., Rj-1 are constants and Ei,j is the truncation error in the representation of xij by a linear combination of xi, xi2, ..., xij-1. When a Taylor series expansion is used to represent xj (in terms of lower powers of x) in a vicinity of a point, x0 within the measurement interval, the truncation error is Ei,j ) (xi - x0)j. The truncation error can be numerically calculated after regressing xj in terms of a lower degree polynomial. Then Ei,j is obtained from the residual plot. Comparison of eq 7 with eq 4 shows that |Eij| ) |∆i/cj|; thus a smaller Eij represents a higher level of collinearity. Let δ(xij) represent the natural error in xij (resultant from limited precision of xi). In order for xij to contain significant information, which is not already included in the (j - 1) degree polynomial, δ(xij) < Ei,j must hold. Such a criterion guarantees that there are a few accurate digits in xij which are not represented by a linear combination of lower powers of xi. For N data points, this criterion can be rewritten as

||δ(xj)|| < ||Ej||

(8)

where ||‚|| represents a norm (say Euclidean norm). Since |δ(xij)| e |dxj/dx||δi|, a good estimate for δ(xij) is jxij-1δi. Thus, the criterion for adding a new term xj to the existing terms of a polynomial, eq 8, can be rewritten as

Erj )

||Ej|| ||jxj-1δ||

>1

(9)

where Erj is the ratio of the norm of the truncation error to the norm of the natural error. If Erj is significantly greater than 1, the jth degree of x can be safely added to the polynomial. On the other hand, when Erj is smaller than 1, there is no independent information in xj which is not represented by a combination of the other polynomial terms. Thus, adding this term will not reduce the variance significantly and will not improve the correlation. Numerical experimentation has shown that the threshold value for Erj is not exactly 1 but in the range between 1 and 10.

Ind. Eng. Chem. Res., Vol. 36, No. 10, 1997 4407

Belsley (1991) lists several additional collinearity indicators, such as the variance inflation factor, eigenvalues and eigenvectors of the normal matrix, coefficient of variation, etc. These indicators use essentially the same information that is included in either the truncation error or in the condition number or in the confidence interval. Therefore, these additional indicators will not be included in the subsequent discussion. 3. Transformations To Reduce Collinearity There are several transformations that can be used to reduce collinearity. The one which is routinely used is division of the values of xi by xmax, where xmax is the point with the largest absolute value. Thus, vi ) xi/xmax, where vi is the normalized xi value. If all the xi are of the same sign (say xi > 0), then the normalized value will vary in the range 0 < vmin e vi e 1. This transformation can considerably change the value of Ei,j (for example, when xi . 1) and consequently reduce the condition number. But, the relative truncation error, defined in eq 9, will not change significantly, since with this transformation both the range (xi - x0) and xi change in the same proportion. The transformation wi ) (xi - xmin)/xmax - xmin) yields values in the range 0 e wi e 1. For this transformation, the maximal range is (wi - w0)max ∼ 0.5 and wmax ) 1. Using such a transformation yielded most accurate correlations for vapor pressure (see for example Wagner, 1973). The transformation zi ) (2xi - xmax - xmin)/(xmax xmin) yields variable distribution in the range of -1 e zi e 1. Similar transformations are widely used and highly recommended by statisticians (see for example Seber, 1977). In the following sections, the effects of the various transformations on collinearity will be studied. 4. Comparison of the Transformations Using the Condition Number Assuming that xi is distributed uniformly and N is large enough, the elements of the normal matrix can be evaluated. Forsythe (1957) derived the expression for the w transformation:

(XTX)rs ≈

N r+s-1

(10)

where r and s are the row and column indexes respectively. The elements of the normal matrix, as shown in eq 10, are N times the elements of the Hilbert matrix, which is known to be ill-conditioned for large N. For the v transformation, assuming vi is distributed approximately uniformly on [vmin,1], where 0 e vmin< 1, the elements of the normal matrix are given by

(XTX)rs ≈ N

(r+s-1) ] [1 - vmin

(1 - vmin)(r + s - 1)

(11)

For the z[-1,1] transformation,

(XTX)rs ≈

{

if r + s - 1 even N ; if r + s - 1 odd r+s-1

Figure 1. Condition number versus polynomial order for various transformations. Table 1. Condition Numbers for 4th Order Polynomials and Slope of log[K(A)] Versus Polynomial Order transformation

condition number

slope of log(κ(A)) vs polynomial order

z[-1,1] w[0,1] v[0.3,1] v[0.5,1] v[0.7,1] v[0.9,1]

3.5829 × 102 4.7661 × 105 2.2057 × 107 6.3676 × 108 7.6006 × 1010 1.666 × 1011

0.72 1.5 2.0 2.2 2.77 3.29

Equations 10-12 can be used to calculate the condition numbers of the normal matrices corresponding to various degree polynomials using the various transformations. For instance, the condition number for 4th order polynomials is shown in Table 1. The condition number is the smallest (numerical value 358) for the z[-1,1] transformation, it increases by 3 orders of magnitude for the w[0,1] transformation, and it keeps increasing when the range of the v transformation becomes narrower. It reaches κ(A) ) 1.666 × 1011 for the v[0.9,1] transformation. Figure 1 presents the condition number versus polynomial order for the various transformations on a semilogarithmic scale. It can be seen that for a particular transformation log [κ(A)] can be represented by a straight line as a function of the order of the polynomial. The slopes of these straight lines for the various transformations are listed in Table 1. The slope is the smallest for the z[-1,1] transformation (0.72) and increases with a narrowing of the range. For the v[0.9,1] transformation, it reaches the value of 3.29. Thus, the z transformation offers the most significant reduction of the condition number and is superior to the other transformations in this respect. The information provided by the condition number alone is insufficient for estimating the maximum polynomial order of statistical validity. It can, however, be used to establish the maximal polynomial order on the basis of numerical considerations. By testing Hilbert matrices, of various orders, one can find that for κ(A) ∼ 1018 most elements of the inverse matrix does not contain even a single accurate digit (using double precision variables containing ∼15 significant digits). This limit seems to be very far if the data is in the range of [-1,1]. But, if the range is such that normalization brings it to the [0.9,1] interval, then even a 5th order polynomial can reach the numerical limit.

0;

(12)

Thus, in this case, the elements of the normal matrix are the same as the Hilbert matrix elements except that every other term is replaced by zero.

5. Orthogonal and Legendre Polynomials One of the approaches often suggested to reduce collinearity in polynomial regression is carrying out the regression with orthogonal polynomials (see for example

4408 Ind. Eng. Chem. Res., Vol. 36, No. 10, 1997 Table 2. Truncation Errors and Condition Numbers for Correlation with Legendre Polynomials

a

deg. of polynomial

||Ej||a

κ(A)

1 2 3 4 5 6

0.5774 0.2981 0.1512 0.0762 0.0383 0.0192

3.0 11.25 43.75 172.27 682.17 2709.74

6. Estimation of the Maximal Polynomial Orders Justifiable on Statistical Grounds

||Ej|| ) [1/2∫1-1 φj2 dz]1/2.

Seber (1977), pp 215-216). The basic property of orthogonal polynomials φ(x) is

OrTOs ) 0

for all r, s, r * s

(13)

Because of this property, orthogonal polynomials yield diagonal normal matrices; consequently, the calculated parameter values of the correlation equation are independent of each other. There are several ways to generate orthogonal polynomials (for description of some of them see p 220 in Seber (1977)). In order for most methods to give truly orthogonal polynomials, the data must be evenly distributed. A method which generates orthogonal polynomials, independent of the original data distribution and that can be easily carried out with an interactive regression program (such as POLYMATH) is described in the Appendix. Comparing eq 7 with eq A-1 in the Appendix, shows that Ej ) φj. Thus, the truncation error is equal to the respective orthogonal polynomial when the regression is carried out with such polynomials. In this case, a simple relationship exists between the truncation error and the condition number. Since in regression with orthogonal polynomials, the normal matrix is a diagonal matrix, the singular values of the matrix are the diagonal elements. With a proper normalization of the data λmax) 1 and λmin ) OjTOj. Thus, the condition number is given by

κ(A) ) (OjTOj)-1

(14)

From the definition of the norm of the truncation error, the following relationship can be derived

||Ej|| ) x[κ(A)]

-1

truncation error for regression with orthogonal polynomials (up to the 6th order), for the w and v transformations are listed in the Appendix. In the next section, orthogonal polynomials of the various transformed variables will be used to predict the maximal order of the polynomial which is justifiable on statistical grounds.

(15)

For a large number of data points which are evenly spaced in the [-1,1] interval, Legendre polynomials provide an orthogonal set (see Appendix) and can be integrated over the [-1,1] interval to yield the norm of the truncation error and the corresponding condition number. The calculated values of ||Ej|| and κ(A) are shown in Table 2. The slope of the log[κ(A)] versus the polynomial order line is 0.59. Thus, increasing the order of the polynomial increases the condition number approximately by a factor of 4 while the truncation error is reduced by a factor of 0.5. Comparing these results with the results shown in Table 1 and Figure 1 indicates that the use of the z[-1,1] transformation with orthogonalization is indeed less sensitive to the effects of collinearity as indicated by the condition number. Legendre polynomials can be used to obtain the orthogonal polynomials in any specified interval [a,b], and in particular, for the w[0,1] and v[vmin,1] transformations. The respective equations, which yield the

Orthogonal polynomials can be used for calculating Erj as a function of the precision and the range of the independent variable. The value of ||Ej|| based on Legendre polynomials is shown in Table 2. For j g 1, this norm can be correlated by the expression ||Ej|| ) 0.583/2j-1. For sufficiently large number of data points, the norm of the natural error is given by the following integral:

Iz ) δ(z)

x ( ) 1 2

j 2

∫-11 dz dz

dz ) δ(z)

j

x2j - 1

(16)

Dividing ||Ej|| by Iz yields an expression for Erj when the z transformation is used:

0.583x2j - 1 2j-1δ(z)j

(Erj)z )

(17)

For a statistically justifiable polynomial equation, we request that Erj > 1. The values of Erj based on the Legendre polynomial have been calculated using eq 17 and the maximal order that satisfies this criterion has been entered into Table 3. It can be seen that when the precision of the data of the independent variable is low (δ(z) ) 0.01), only regression by up to a 6th order polynomial is justifiable. For high-precision data, (δ ) 10-6) polynomials up to the 18th degree can be fitted. Since δ(z) ) 2δ(x)/(xmax - xmin), the maximal order of the polynomial depends on both the range and the precision of the data of the independent variable. Using eqs A-4 and A-5 in the Appendix, the norm of the truncation error ||Ej|| for the w and v transformations can be calculated as follows:

∫01φj2(w) dw]1/2

||Ej||w ) [ ||Ej||v )

[

1 (1 - vmin)

∫v1

min

(18)

]

φj2(v) dv

1/2

(19)

The results can be correlated versus the polynomial order j. The respective expressions are shown in Table 3. Expressions for the norms of the natural errors for the w and v transformations can be derived on the basis of the same arguments that were used for deriving eq 16. Thus, for the w[0,1] transformation,

Iw ) δ(w)

j

(20)

x2j - 1

and for the v[vmin,1] transformation,

[

Iv ) δ(v)j

1 - vmin2j-1

]

(1 - vmin)(2j - 1)

1/2

(21)

Ind. Eng. Chem. Res., Vol. 36, No. 10, 1997 4409 Table 3. Maximal Polynomial Orders Satisfying Erj > 1 on the Basis of Theoretical Analysis expression for ||Ej||

transformation

δ ) 10-2 a

0.583/2.0j-1

z[-1,1] w[0,1] v[0.5,1] v[0.9,1]

maximum polynomial order, j, for Erj > 10 δ ) 10-3 δ ) 10-4 δ ) 10-5

6 3 2 1

0.289/3.88j-1 0.144/7.86j-1 0.0288/38.7j-1

9 4 3 1

12 6 4 2

15 8 5 2

δ ) 10-6 18 9 6 3

a Normalized error: δ ) δx/(x max - xmin). The corresponding natural errors in the transformed variables are: δ(z) ) 2δ, δ(w) ) δ, δ(v) ) δ(1 - vmin).

Table 4. Heat Capacity Data for Solid HBr (Giauque and Wiebe, 1928) and Various Transformations of the Temperature Data no.

Cp (cal/ (mol‚K))

T (K)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18

10.79 10.8 10.86 10.93 10.99 10.96 10.98 11.03 11.08 11.1 11.19 11.25 11.4 11.61 11.69 11.91 12.07 12.32

118.99 120.76 122.71 125.48 127.31 130.06 132.41 135.89 139.02 140.25 145.61 153.45 158.03 162.72 167.67 172.86 177.52 182.09

δ

0.05

v ) T/Tmax

w ) (T - Tmin)/ (Tmax - Tmin)

0.653 468 0.663 189 0.673 898 0.689 11 0.699 16 0.714 262 0.727 168 0.746 279 0.763 469 0.770 224 0.799 66 0.842 715 0.867 868 0.893 624 0.920 808 0.949 311 0.974 903 1.0

0.0 0.028 0507 0.058 954 0.102 853 0.131 854 0.175 436 0.212 678 0.267 829 0.317 433 0.336 926 0.421 87 0.546 117 0.6187 0.693 027 0.771 474 0.853 724 0.927 575 1.0

2.746 × 10-4

7.924 × 10-4

z ) (2T - Tmax Tmin)/(Tmax - Tmin) -1 -0.943 899 -0.882 092 -0.794 295 -0.736 292 -0.649 128 -0.574 643 -0.464 342 -0.365 135 -0.326 149 -0.156 26 0.092 2345 0.237 401 0.386 054 0.542 948 0.707 448 0.855 151 1.0 1.585 ×10-3

Using the expressions for ||Ej|| obtained by eqs 18 or 19 and eq 20 or 21 for the norm of the natural error, the value of Erj for the w or v transformation can be calculated. Table 3 shows the maximal polynomial order for which Erj > 1. The results show the same trend as indicated by the condition number (Table 1 and Figure 1). The z transformation allows fitting the highest order polynomial for data of a particular precision. The narrower the range of the data, the lower the maximal order of the polynomial that can be fitted to the data. 7. Correlation of Heat Capacity Data of Solid HBrsAn Example 7.1. Polynomial Regression of the Heat Capacity Data. Heat capacity versus temperature data are usually correlated by polynomials. Daubert and Danner (1987), for example, used a 4th order polynomial to correlate heat capacity (Cp) data versus temperature for solid HBr, as published by Giauque and Wiebe (1928). Suggested validity range of the correlation is 125 K e T e 185.15 K. Solid HBr exhibits several phase transitions which are associated with discontinuities in the heat capacity versus temperature curve. The data considered here is only that part of the data which is between the highest temperature transitional point and the melting point, T ) 118.99 K ÷ 182.09 K. The reported precision of the temperature measurements is δi ) (0.05 K. The estimated error in the heat capacity data is i ) (0.3%. Table 4 shows the ranges and the estimated errors in the temperature and the transformed variables of the Cp data. To demonstrate the problems associated with collinearity, we first fit polynomials of up to 4th order to the normalized values of Cp (C h pi ) Cpi/12.32) versus normalized T (vi ) Ti/182.09). Indeed, the 4th order

Figure 2. Fourth-order polynomial representation of the heat capacity data and residual and normal probability plots.

polynomial correlates the data fairly well (Figure 2a). This conclusion is further reinforced by the residual plot (Figure 2b) and the corresponding normal probability plot (Figure 2c). The error may be considered as

4410 Ind. Eng. Chem. Res., Vol. 36, No. 10, 1997 Table 5. Parameters, Confidence Intervals, and Variances of Polynomials of Various Orders Representing Normalized C h p as a Function of Normalized Temperature polynomial parameter

1st order

2nd order

3rd order

4th order

βˆ 0 βˆ 1 βˆ 2 βˆ 3 βˆ 4 s2

0.660 599 ( 0.032 63 0.319 41 ( 0.040 56

1.171 93 ( 0.1371 -0.951 116 ( 0.3392 0.774 343 ( 0.2064

-0.382 334 ( 0.8381 4.838 31 ( 3.11 -6.340 61 ( 3.813 2.885 02 ( 1.545

8.009 × 10-5

1.624 × 10-5

8.110 × 10-6

-6.343 05 ( 7.677 34.4208 ( 38.01 -60.989 99 ( 70.10 47.4234 ( 57.08 -13.5127 ( 17.3107 7.166 × 10-6

Table 6. Collinearity Indicators for Various Transformations

Table 7. Parameters, Confidence Intervals, and Variances of Polynomials Representing C h p(z)

variable

polynomial

κ(A)

Erj

T (K)

3rd order 4th order 3rd order 4th order 3rd order 4th order 3rd order 4th order

4.1844 × 1018 3.15 × 1023 3.1443 × 107 1.4096 × 1010 9.538 × 103 3.113 × 105 6.0623 × 101 3.7726 × 102

1.682 0.123 1.659 0.123 16.85 3.46 69.09 28.34

v w z

approximately normally distributed with a maximal error of 0.8%, which is about twice the estimated experimental error. Table 5 shows the parameter values, the confidence intervals, and the variances for up to 4th order polynomials. While the variance decreases continuously from the 1st to the 4th order polynomials, the confidence intervals increase when using higher order polynomials. For the 3rd order polynomial, the value of βˆ 0 is no longer significantly different from zero and for the 4th order polynomial already none of the parameters are significantly different from zero. Consequently, only the use of a 2nd order polynomial can be justified on a statistical ground, since for 3rd or 4th order polynomials, a small change in the data can introduce large changes in the parameter values. Indeed, removing one data point (T ) 153.45 K) from the set yields the following 4th order polynomial parameters: βˆ 0 ) -3.965 94, β1 ) 22.694, β2 ) -39.4994, β3 ) 30.0849, and β4 ) -8.314 85. These values are completely different from the values shown in Table 5, while the changes affected in the parameter values of the 2nd order polynomial are very small. Table 6 shows κ(A) and Erj for the various transformations, which indicate the role of the collinearity in rendering the 3rd and 4th order polynomials statistically invalid. These collinearity diagnostics can also indicate which of the transformations alleviates the undesired effects of collinearity. It can be seen that when using the non-normalized temperature data, the condition numbers for both 3rd and 4th order polynomials are astronomical. For this reason, most regression programs normalize the data without requiring any user intervention. The corresponding values of Erj are 1.682 for the 3rd order polynomial and less than 1 for the 4th order polynomial, indicating that a 3rd order polynomial represents a borderline case, while in T4 there is not even a single digit which bears information independent of that already included in the lower terms of the polynomial. Normalization of the temperature data (dividing by Tmax), which results in the v[0.65,1] transformation, reduces the condition numbers considerably but does not change Erj, indicating that collinearity still prevents the use of 3rd or 4th order polynomials for correlating the Cp data. Using the w[0,1] transformation reduces the condition numbers further, and Er3 reaches a value of 22.68, indicating that with this transformation collinearity

polynomial parameter

3rd order

4th order

βˆ 0 βˆ 1 βˆ 2 βˆ 3 βˆ 4 s2

0.914 142 ( 0.002 413 0.046 7734 ( 0.005 957 0.024 4619 ( 0.004 456 0.015 0068 ( 0.008 03

0.912 586 ( 0.003 031 0.047 4777 ( 0.005 71 0.036 4789 ( 0.007 67 0.014 2414 ( 0.007 67 -0.012 1785 ( 0.0156 7.1664 × 10-6

8.109 84 × 10-6

should not prevent the use of a 3rd order polynomial. The transformation that minimizes the effects of collinearity is the z[-1,1] transformation. The condition numbers are the smallest. Both Er3 and Er4 are greater than 10, indicating that collinearity will not prevent the use of even a 4th order polynomial. Table 7 shows the parameter values, confidence intervals, and variances for 3rd and 4th order polynomials representing C h p as a function of the transformed variable z. It can be seen that, for the 3rd order polynomial, all the parameters are significantly different from zero, making this correlation statistically valid. For the 4th order polynomial only the coefficient associated with the z4 term is not significantly different from zero, indicating that the precision of the data (not collinearity) limits the maximum degree of the polynomial to 3. The z[-1,1] transformation eliminates, in this particular case, the undesired effects of collinearity. If, for example, the data set is changed by removing a data point (same as above), the parameters of the 4th order polynomial change very little except βˆ 4, which is not significantly different from zero. In order to check the effects of orthogonalization, orthogonal polynomials were generated using eq A-1 for the various transformed variables. Orthogonalization indeed changes the condition numbers. For the v transformation and 4th order, polynomial κ(A) ) 1.61 × 108, for the w transformation κ(A) ) 3.35 × 104, and for the z transformation κ(A) ) 131.0 are obtained. Comparing these values with those shown in Table 6 reveals that orthogonalization reduces the condition number, but it does not change the truncation error ||Ej||. Since the norm of the natural error does not change either, it can be concluded that orthogonalization has no significant effect on the collinearity indicated by Erj. The fact that the condition number may be reduced without affecting the truncation error indicates that the condition number is not a reliable collinearity diagnostic. 7.2. Comparison with Theoretical Results. Using the methods described in Section 4 for calculating the condition numbers and the methods described in Section 6 to predict ||Ej|| and Erj, the calculated results (of Table 6) can be compared to the corresponding theoretical values. Table 8 shows this comparison. It can be seen that the agreement between the calculated and theoretical values of κ(A), ||Ej||, and Erj

Ind. Eng. Chem. Res., Vol. 36, No. 10, 1997 4411 Table 8. Collinearity Indicators for the Heat Capacity Data in Comparison with Theoretical Values condition number variable polynomial order v[0.65,1] w[0,1] z[-1,1]

3 4 3 4 3 4

calculated

theoretical

4.788 × 107 3.1443 × 107 1.4096 × 1010 2.03 × 1010 0.954 × 104 1.55 × 104 3.113 × 105 4.766 × 105 60.62 67.6 377.26 358.29

truncation error

natural error

calculated

theoretical

9.2 × 10-4 7.87 × 10-5 0.022 5.44 × 10-3 0.179 0.087

7.826 × 10-4 6.828 × 10-5 0.0189 4.76 ×10-3 0.151 0.076

Erj

calculated theoretical calculated theoretical 0.000 55 0.000 64 0.000 98 0.001 25 0.0026 0.0031

5.85 × 10-4 6.84 × 10-4 0.001 06 0.0012 0.002 13 0.002 40

1.66 0.123 22.68 4.74 69.09 28.34

1.34 0.1 17.8 3.97 71.9 31.6

are excellent for the w and z transformations, slightly worse but still satisfactory for the v[0.65,1] transformation. It should be emphasized that the theoretical values are approached for very large (infinite) number of data points which are evenly distributed on the measurements’ interval. Hence, calculated values should get even closer to the theoretical values for data sets containing a large number of data points. The excellent agreement between the calculated and theoretical values in this example (and in some additional cases we have tested) indicates that the maximum polynomial orders given in Table 3 can be considered fairly realistic. More numerical experimentation, using data sets of different sizes, ranges, and precisions, may provide more information for fine tuning the bounds for the maximal polynomial orders.

thus, yielding a much more accurate and well-behaved correlation equation.

8. Summary and Conclusions

where xj is the mean of xi and R0,j, R1,j, ..., Rj-1,j are parameters of the regression equation when correlating xj with φ1(x), φ2(x), ..., φj-1(x). A proof that the method described by eq A-1 generates orthogonal directions is given in many textbooks (for example, Box and Draper, 1987). Original or transformed variables can be used to generate orthogonal polynomials. For large numbers of data points which are evenly spaced in the [-1,1] interval, Legendre polynomials Pj (z) provide an orthogonal set. These are defined by the following recursive equations:

We have used both theoretical analysis and numerical experimentation to investigate the various effects of collinearity in polynomial regression. A new criterion for measuring collinearity, the truncation to natural error norms ratio Er, has been introduced. This criterion depends only on the independent variable data (range, precision, etc.) and allows the setting of an upper limit on the polynomial degree on the basis of the restriction imposed by collinearity. The diagnostic threshold offered is Er > 1. The use of orthogonal polynomials and the w[0,1] and z[-1,1] transformations for reducing the effects of collinearity have been described and demonstrated in detail. The theoretical analysis and the numerical experimentations have shown that the z[-1,1] transformation minimizes the effects of collinearity and has the following advantages over the other transformations: 1. It yields the smallest condition number and the largest truncation-to-natural error norms ratio for the same data and the same polynomial order. Consequently, it allows the fitting of the highest degree polynomial to the data. 2. It yields the smallest off-diagonal elements in the normal matrix. Because of that, the interdependency among the calculated parameter values is the smallest and statistical indicators, such as confidence intervals, can clearly show which of the monomials are needed and which should be omitted. The analysis has also shown that using orthogonal polynomials reduces the condition number, but does not increase the truncation-to-natural error ratio. Thus, the use of such polynomials, as means for reducing collinearity, is not recommended. The example presented for correlating heat capacity data of solid HBr by a polynomial equation verifies the results of the theoretical analysis. In this example, the use of the z transformation enables the increase of the statistically valid polynomial correlation from 2 to 3,

AppendixsOrthogonal Polynomials The equations used to generate the orthogonal polynomials are the following:

φ0(x) ) 1 φ1(x) ) x - xj l φj(x) ) xj - [R0,j + R1,jφ1(x) + R2,jφ2(x) + ... + Rj-1,jφj-1(x)] (A-1)

P0(z) ) 1

P1(z) ) z

(j + 1)Pj+1(z) ) (2j + 1)zPj(z) - jPj-1(z) (A-2) The coefficients for representing zj in terms of Pk(z), j zj ) Σk)0 RkPk(z) are tabulated, for example, by Abramovitz and Stegun (1972). The first six Legendre polynomials, scaled to yield the respective truncation error, φj ) RjPj(z), are given by the following:

For the z[-1,1] transformation, φ0(z) ) 1 1 φ2(z) ) (-1 + 3z2) 3 1 φ3(z) ) (-3z + 5z3) 5 φ4(z) ) φ5(z) ) φ6(z) )

1 (3 - 30z2 + 35z4) 35

1 (15z - 70z3 + 63z5) 63

1 (-5 +105z2 - 315z4 + 231z6) (A-3) 231

4412 Ind. Eng. Chem. Res., Vol. 36, No. 10, 1997 Belsley, D. A. Condition Diagnostics: Collinearity and Weak Data in Regression; John Wiley: New York, 1991.

For the w[0,1] transformation, φ0(w) ) 1

Box, G. E.; Draper, N. R. Empirial Model Building and Response Surfaces; Wiley: New York, 1987.

1 φ1(w) ) (2w - 1) 2 1 φ2(w) ) (6w2 - 6w + 1) 6 φ3(w) )

1 (2w - 1)[-3 + 5(2w - 1)2] 40

1 φ4(w) ) [3 - 30(2w - 1)2 + 35(2w - 1)4] 560 φ5(w) )

φ6(w) )

1 [15(2w - 1) - 70(2w - 1)3 + 2016 63(2w - 1)5] 1 [-5 + 105(2w - 1)2 - 315(2w - 1)4 + 14784 231(2w - 1)6] (A-4)

Bradley, R. A.; Srivastava, S. S. Correlation in Polynomial Regression. Am. Stat. 1979, 33, 11-14. Daubert, T. E.; Danner, R. P. Physical and Thermodynamic Properties of Pure Chemicals: Data Compilation; Hemisphere Publishing Co.: New York, 1989. Forsythe, G. E. Generation and Use of Orthogonal Polynomials for Data-Fitting with Digital Computer. J. Soc. Ind. Appl. Math. 1957, 5, 74-87. Giaugue, W. F.; Wiebe, R. The Heat Capacity of Hydrogen Bromide from 15°K to its Boiling Point and its Heat of Vaporization. Am. Chem. 1928, 50, 2193-2203. Gunns, R. F. Towards a balanced assessment of collinearity diagnostics. Am. Stat. 1984, 38 (2), 79-82. Mandel, J. Evaluation and Control of Measurements, Quality and Reliability; Marcel Dekker, Inc.: New York, 1991.

For the v[0,1] transformation,

Math Works, Inc. The Student Edition of MATLAB; Prentice Hall: Englewood Cliffs, N. J., 1992.

φ0(v) ) 1

Reid, R. C.; Prausnitz, J. M.; Sherwood T. K. Properties of Gases and Liquids, 3rd ed.; McGraw Hill: New York, 1977.

φ1(v) ) v φ2(v) )

vmin + 1 2

(1 - vmin)2 (-1 + 3u2); 12

u)

Seber, G. A. F. Linear Regression Analysis; Wiley: New York, 1977.

2v - vmin - 1 1 - vmin

3

φ3(v) )

(1 - vmin) (-3 + 5u2)u 40

(1 - vmin)4 (3 - 30u2 + 35u4) φ4(v) ) 560 φ5(v) ) φ6(v) )

(1 - vmin)5 (15u - 70u3 + 63u5) 2016

(1 - vmin)6 (-5 + 105u2 - 315u4 + 231u6) 14784 (A-5)

Shacham, M.; Cutlip, M. B. POLYMATH 4.0 User’s Manual; CACHE Corporation: Austin, TX, 1996. Wagner, W. New Vapor Pressure Measurements for Argon and Nitrogen and a New Method for Establishing Rational Vapor Pressure Equations. Cryogenics 1973, 13, 470. Wold, S.; Ruhe, A.; Wold, H.; Dunn, W. J., III. The Collinearity Problem in Linear Regression. The Partial Least Squares (PLS) Approach to Generalized Inverses. SIAM J. Stat. Comput. 1984, 5 (3), 735-743.

Received for review March 21, 1997 Revised manuscript received July 2, 1997 Accepted July 3, 1997X IE970236K

Literature Cited Abramovitz, M.; Stegun, I. A. Handbook of Mathematical Functions; Dover: New York, 1972.

X Abstract published in Advance ACS Abstracts, September 1, 1997.