Spin Glass Behavior of Isolated, Geometrically Frustrated Tetrahedra

Feb 12, 2009 - Metal flux synthesis in a low-melting eutectic mixture of lanthanum and nickel has produced a family of complex intermetallic carbide p...
0 downloads 3 Views 189KB Size
Computer Methods and Programs in Biomedicine 71 (2003) 141 /147 www.elsevier.com/locate/cmpb

Principal component regression analysis with

SPSS

R.X. Liu a,*, J. Kuang b, Q. Gong a, X.L. Hou c a

b

Medical College of Jinan University, Guangzhou 510632, People’s Republic of China Guangdong Provincial People’s Hospital, Guangzhou 510080, People’s Republic of China c Jinan University Library, Guangzhou 510632, People’s Republic of China

Received 30 March 2001; received in revised form 10 April 2002; accepted 11 April 2002

Abstract The paper introduces all indices of multicollinearity diagnoses, the basic principle of principal component regression and determination of ‘best’ equation method. The paper uses an example to describe how to do principal component regression analysis with SPSS 10.0: including all calculating processes of the principal component regression and all operations of linear regression, factor analysis, descriptives, compute variable and bivariate correclations procedures in SPSS 10.0. The principal component regression analysis can be used to overcome disturbance of the multicollinearity. The simplified, speeded up and accurate statistical effect is reached through the principal component regression analysis with SPSS. # 2002 Elsevier Science Ireland Ltd. All rights reserved. Keywords: Multicollinearity diagnosis; Principal component regression analysis;

1. Introduction In multivariate analysis, the least-squares method is generally adopted in fitting a multiple linear regression model, but estimation of the least-squares is sometimes far from being perfect. One of important causes leading to the result is column vectors of matrix X is close to linear correlation. Approximate linear relationship among independent variables is called multicollinearity. That there exists multicollinearity among independent variables tends to lead to the result

* Corresponding author. Tel.: /86-20-8522-0259; fax: /8620-8522-1343. E-mail address: [email protected] (R.X. Liu).

SPSS

that symbol and value of actual regression coefficient are not consistent with the expected ones. The often-used index to justify collinearity is simple correlation coefficient. When simple correlation coefficient between two independent variables is large, the collinearity is considered. Apart from simple correlation coefficient, SPSS provides collinearity statistics ([1], pp. 221): tolerance and variance inflation factor (VIF). Tolerance /1/ R2i , where R2i is squared multiple correlation of ith variable with other independent variables. When its value is small (close to 0), the variable is almost a linear combination of the other independent variables. VIF is reciprocal of tolerance. Variables with low tolerance tend to have large VIF, so variables with low tolerance and

0169-2607/03/$ - see front matter # 2002 Elsevier Science Ireland Ltd. All rights reserved. doi:10.1016/S0169-2607(02)00058-5

142

R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

large VIF suggest that they have a collinearity. Eigenvalue, condition index and variance proportion are also indices of collinearity diagnosis ([1], pp. 229 /230). Eigenvalues provide an indication of how many distinct dimensions there are among independent variables. When several eigenvalues are close to 0, the variables are highly intercorrelated and the matrix is said to be ill-conditioned. Condition indices are square roots of ratios of the largest eigenvalue to each successive eigenvalue. A condition index greater than 15 indicates a possible problem and an index greater than 30 suggests a serious problem with collinearity. Variance proportions are proportions of variance of estimate accounted for by each principal component associated with each of eigenvalues. A component associated with a high condition index contributes substantially to variance of two or more variables, so independent variables with large variances are the ones being highly intercorrelated. The principal component regression is the method of combining linear regression with principal component analysis ([2], pp. 327/332). The principal component analysis can gather highly correlated independent variables into a principal component, and all principal components are independent of each other, so that all it does is to transform a set of correlated variables to a set of uncorrelated principal components. Then we built the regression equations with a set of uncorrelated principal components and get the ‘best’ equation according to the principle of the maximum adjusted R2 and minimum standard error of estimate. At last the ‘best’ equation is transformed into the general linear regression equation. The present paper will demonstrate how multicollinearity problem is solved by using SPSS 10.0 to do the principal component regression [3].

2. Basic principle and formulas (1) Proceed a stepwise regression with a dependent variable Y and all independent variables X for getting the p independent variables with statistical significances (P B/0.05) and revealing whether the p independent variables have a multicollinearity or not.

(2) Proceed a principal component analysis with the p independent variables for transforming a set of correlated variables to a set of uncorrelated principal components and indicating information quantities of different set of principal components. (3) Compute the standardized dependent variable, the p standardized independent variables and the values of the p principal components respectively according to Eqs. (1) /(3) for making a preparation of setting up p standardized principal component regression equations. Y ?(Y  Y¯ )=SY ¯ X ?(X i i  X i )=SX

(1) i

(i 1; . . .; p)

Ci  ai1 X 1? ai2 X 2? . . .aip X p? (i  1; . . .; p)

(2) (3)

where Y ? stands for the standardized dependent variable, Y the dependent variable, SY the standard deviation of dependent variable, Y¯ the mean of dependent variable, Xi? the ith standardized independent variable, Xi the ith independent variable, X¯ i the mean of the ith independent variable, SXi the standard deviation of the ith independent variable, Ci the ith principal component, aij the coefficient of principal component matrix (the matrix consists of Ci and Xi?). (4) Built the standardized principal component regression equation with the first principal component, then add principal component backwards one by one to get the m standardized principal component regression equations, as shown in Eq. (4), meanwhile check whether all principal components are independent of each other or not, then determine the ‘best’ standardized principal component regression equation in Eq. (4) on the basis of the maximum adjusted R2 and minimum standard error of estimate by comparing the adjusted R2 and standard error of estimate of each standardized principal component regression equation. X y? ˆj  B?i Ci (j  1; :::; m 5p; i 1; :::; K 5 p) (4) where y? ˆ j is the estimate of the j th standardized principal component regression equation, Bi? the

R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

ith standardized partial regression coefficient of the standardized principal component regression equation. (5) Applying Eq. (3) to the ‘best’ standardized principal component regression equation yields the standardized linear regression equation after sorting it out, as shown in Eq. (5). X y? ˆ b?i X i? (i 1; :::; K 5 p) (5) where y? ˆ is the estimate of the standardized linear regression equation, b?i the ith standardized partial regression coefficient of the standardized linear regression equation. (6) Compute partial regression coefficients and constant, as shown in Eqs. (6) and (7), at last, transform the standardized linear regression equation into the general linear regression equation, as shown in Eq. (8). bi  b?(Lyy=Lx x )1=2 (i 1; . . .; K 5p) i X i i ¯ b0  Y bi X¯ i (i 1; . . .; K 5p) X bi Xi (i 1; . . .; K 5 p) y ˆ b0 

(6) (7) (8)

where bi is the ith partial regression coefficient of the general linear regression equation, Lyy the sum of squares of dependent variable Y , Lxi xi the sum of squares of the ith independent variable Xi , b0 the constant of the general linear regression equation.

3. Example Between the years of 1951 and 1998 (in which the data of 1969 and 1986 are not indexed), the mortality (1/100000) of traffic accident in the mainland of China each year, the quantity (10000 vehicles) of motors, the quantity (10000 tons) of freight transport, the quantity (10000

143

persons) of passenger transport, the mileage (10000 km) of motor’s running on formal highway, and the mileage (10000 km) of motor’s running on informal highway are respectively expressed as the dependent variable Y and independent variables X1, X2, X3, X4 and X5. Table 1 displays the mean and standard deviation (S.D.) of all variables. (1) Select the significant independent variables (P B/0.05) with SPSS backward and diagnose the multicollinearity for each independent variable ([3], pp. 299/308). In the SPSS linear regression dialog box, enter ‘y ’ (the dependent variable) into the dependent box and ‘x1, x2, x3, x4 and x5’ (all independent variables) into the independent box, select the backward in the method selection control. In the SPSS linear regression: statistics dialog box, click on Descriptives, Covariance matrix and Collinearity diagnostics, while the others are the items assumed by SPSS. After running the SPSS linear regression procedure, obtain the results of Tables 2 and 3. Table 2 displays that partial regression coefficients b1, b3 and b4 of three independent variables (X1, X3 and X4) are highly significant (P B/0.0005) and b1 is equal to /7.52 /104, which indicates there is a negative correlation between the mortality of traffic accident and the quantity of motors. The result is contrary to the common sense. We will check whether there are multicollinearities among the independent variables. It also displays that toleranceX1 and toleranceX3 are small (0.04 and 0.022), VIFX1 and VIFX3 are large (25.233 and 46.402). Table 3 shows that RX1,X3 is large (/ 0.950), the 4th eigenvalue is close to 0 (0.007352), condition index is more than 15 (21.362) and the variance proportions of the independent variables X1 and X3 are large (0.88 and 0.99). These facts

Table 1 The mean and S.D. of all variables

Mean SD

Y

X1

X2

X3

X4

X5

2.3504 2.0428

730.1911 1196.0913

324111.65 347739.798

313836.61 372520.677

54.1220 31.5176

18.5461 5.6636

R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

144

Table 2 Partial regression coefficient and collinearity statistics of the linear regression equation Model

Constant X1 X3 X4

bi

t

/8.94/10 2 /7.52/10 4 6.132/10 6 1.967/10 2

P

/0.773 /3.972 7.439 4.891

Collinearity statistics

0.444 0.000 0.000 0.000

Tolerance

VIF

0.040 0.022 0.126

25.233 46.402 7.906

Table 3 Indices of collinearity diagnosis and simple correlation coefficient RXi Xj of the linear regression equation Dimension

1 2 3 4

Eigenvalue

Condition index

3.355 0.572 0.06533 0.007352

% of variance

1.000 2.422 7.166 21.362

Constant

X1

X3

X4

0.01 0.14 0.56 0.29

0.00 0.02 0.10 0.88

0.00 0.00 0.01 0.99

0.00 0.00 0.24 0.76

Table 4 The Eigenvalue, % of variance and coefficients for each principal component Principal component

1 2 3

Eigenvalue

2.746 0.241 0.01333

% of variance

91.523 8.033 0.444

% of cumulative variance

91.523 99.556 100.000

clearly indicate that there is a collinearity between X1 and X3. (2) Use the SPSS factor analysis procedure to obtain the principal component matrix of the independent variables X1, X3 and X4 and the cumulative variance proportion of different principal components ([3], pp. 323-334). In the SPSS factor analysis dialog box, enter ‘x1, x3 and x4’ (the independent variables X1, X3 and X4) into the variable box. In the factor analysis extraction dialog box, click on Number of factors, and type ‘3’ into the box, whereas the others are the items assumed by SPSS. All results are shown in Table 4 after running the factor analysis.

Standardized independent variable X1?

X3?

X4?

0.954 /0.292 0.06481

0.993 /0.0787 /0.0906

0.922 0.387 0.03044

Table 4 displays that the cumulative variance proportion of one principal component (the 1st principal component C1) is 91.523%, the one of two principal components (C1 and C2) is 99.556% and the one of three principal components (C1, C2 and C3) is 100.000%. In Table 4, obtain coefficients (aij ) related the three standardized independent variables to three principal components to create expressions of three principal components: C1 /0.954X1?/0.993X3?/0.922X4?, C2 // ? ? ? 0.292X1/0.0787X3/0.387X4, C3 /0.06481X1?/ 0.0906X3?/0.03044X4?. (3) Obtain the standardized dependent variable Y ?, the standardized independent variables X1?, X3? and X4? by using the SPSS descriptives procedure

R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

([3], pp. 219 /222) and the value of each principal component Ci according to expression of each principal component and by using the SPSS compute variable procedure ([3], pp. 89 /92). In the SPSS descriptives dialog box, enter ‘y, x1, x3 and x4’ (the dependent variable Y and the independent variables X1, X3 and X4) into the variable[s]: box, click on Save standardized values as variables and click on OK button to run the SPSS descriptives procedure to create the standardized dependent variable zy and the standardized independent variables zx1, zx3 and zx4 in the current working data file. In the SPSS compute variable dialog box, type ‘c1’ into the target box as the variable name of the first principal component C1, type ‘0.954+zx1/ 0.993+zx3/0.922+zx4’ into the numeric expression box, click on OK button to run the SPSS compute variable procedure to create a new variable c1 and its value in the current working data file. When the second and third principal components C2 and C3 are computed, type ‘c2’ and ‘c3’ into the target boxes, and type ‘/0.292+zx1/0.0787+zx3/ 0.387+zx4’ and ‘0.06481+zx1/0.0906+zx3/ 0.03044+zx4’ into the numeric expression box respectively. After running the SPSS compute variable procedure one by one, generate the new variables c2 and c3 and their values in the current working data file. (4) Use the SPSS linear regression procedure to do the principal component regression analysis: includes to build each standardized principal component regression equation, check whether

145

all principal components are independent of each other or not and determine the ‘best’ standardized principal component regression equation ([3], pp. 299/308). In the SPSS linear regression dialog box, enter ‘zy and c1’ into the dependent and the independent boxes respectively. In the SPSS linear regression: statistics dialog box, click on Covariance matrix and Collinearity diagnostics, while the others are the items assumed by SPSS. Thus, generate the 1st standardized principal component regression equation: y? ˆ1/ /B1?C1. Following the same steps, fit the equations: y? ˆ2/ /B1?C1/B2?C2 and y? ˆ3/ /B1?C1/ B2?C2/B3?C3. The differences of their operations are that in the independent box, the former is entered into ‘c1 and c2’, and the latter is entered into ‘c1, c2 and c3’. After running the SPSS linear regression procedure, all results are shown in Tables 5 /7, respectively. In Table 5, obtain all standardized partial regression coefficients Bi? with high significance (P B/0.0005) of all principal components Ci in all models (equations) to generate three standardized principal component regression equations: y? ˆ1/ / 0.971C1, y? ˆ2/ /0.970C1/0.148C2 and y? ˆ3/ / 0.969C1/0.148C2/0.121C3. Table 5 displays all simple correlation coefficients RCi Cj of all principal components Ci are close to 0 and their tolerances and VIFs are equal to 1. Table 6 shows that their eigenvalues and condition indices are close to 1. These suggest that all principal components are independent of each other.

Table 5 Standardized partial regression coefficients, collinearity statistics and correlation coefficients RCi Cj of three standardized principal component regression equations Model

1 C1 2 C1 C2 RC1,C2 //0.009 3 C1 C2 C3 RC1,C2 //0.009

Bi?

0.971 0.970 0.148 0.969 0.148 /0.121

t

P

26.935 33.907 5.189 43.925 6.726 /5.496 RC1,C3 /0.002

Collinearity statistics Tolerance

VIF

0.000 0.000 0.000

1.000 1.000 1.000

1.000 1.000 1.000

0.000 0.000 0.000

1.000 1.000 1.000 RC2,C3 /0.000

1.000 1.000 1.000

146

R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

Table 6 Indices of collinearity diagnosis of three standardized principal component regression equations Dimension

1 2 1 2 3 1 2 3 4

Eigenvalue

1.000 1.000 1.009 1.000 0.991 1.009 1.000 1.000 0.990

Condition index

% of variance

1.000 1.000 1.000 1.005 1.009 1.000 1.005 1.005 1.010

R2 is a measure of goodness of fit of a linear model and tends to be an overestimate of population parameter ([1], pp. 197 /198, 208). R2 ranges from 0 to 1. The closer to 1, the better on goodness of fit of a linear model. As R2 is affected by the number of independent variables in model and sample size, we usually use the adjusted R2 when comparing the goodnesses of fit between different linear models. Adjusted R2 is designed to compensate for the optimistic bias of R2. Standard error of estimate is the square root of the residual mean square and measures the spread of the residuals about the fitted line ([1], p. 198), so it is also a measure of goodness of fit of a linear model. The closer to 0, the better on goodness of fit of a linear model. y? ˆ3/ /0.969C1/0.148C2/0.121C3 is determined as the ‘best’ equation, as Table 7 shows that adjusted R2 (0.978) and standard error of estimate (0.1480) of the third standardized principal component regression equation is the largest and the smallest in the three equations respectively, and its F value is equal to 670.541 and it is also highly significant (P B/0.0005).

C1

C2

C3

0.00 1.00 0.50 0.00 0.50 0.50 0.00 0.00 0.50

0.50 0.00 0.50 0.47 0.07 0.00 0.47

0.03 0.93 0.00 0.04

(5) Using the SPSS bivariate correlations procedure to compute Lyy and Lxi xi ([3], pp. 285/290). In the SPSS bivariate correlations dialog box, enter ‘y , x1, x3 and x4’ (the dependent variable Y and the independent variables X1, X3 and X4) into the variables box. In the bivariate correlation: options dialog box, click on Cross-product deviation and covariance. Run the bivariate correlations procedure to get Lyy /187.788, Lx1x1 /64378550, Lx3x3 /6.245 /1012 and Lx4x4 /44701.175. (6) Transform the ‘best’ standardized principal component regression equation into the standardized linear regression equation and then into the general linear regression equation. From Table 4, get C1 /0.954X1?/0.993X3?/ 0.922X4?, C2 //0.292X1?/0.0787X3?/0.387X4? and C3 /0.06481X1?/0.0906X3?/0.03044X4?, and apply them to the ‘best’ standardized principal component regression equation: y? ˆ / /0.969C1/ 0.148C2/0.121C3 /0.969 (0.954X1?/0.993X3?/ 0.922X4?)/0.148(/0.292X1?/0.0787X3?/0.387X4?) /0.121(0.06481X1?/0.0906X3?/0.03044X4?), then after having sorted it out, obtain the standardized

Table 7 The expression, adjusted R2, standard error of estimate, F value and P value in each equation Standardized principal component regression equation

Adjusted R2

Standard error of estimate

F

P

y? ˆ 1/ /0.971C1 y? ˆ 2/ /0.970C1/0.148C2 /y? ˆ 3/ /0.969C1/0.148C2/0.121C3

0.942 0.963 0.978

0.2418 0.1918 0.1480

725.490 589.972 670.541

B/0.0005 B/0.0005 B/0.0005

/ /

R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

linear regression equation: /y? ˆ / /0.8734X1?/ 0.9615X3?/0.9470X4?. Calculate the general partial regression coefficients bi with b1? /0.8734, b3? /0.9615 and b4? /0.9470 in the light of Eq. (6), b1 /b1? (Lyy /Lx1x1)1/2 /0.8734(187.788/64378550)1/2 / 0.00149, b3 /b3?(Lyy /Lx3x3)1/2 /0.9615(187.788/ 6.245 /1012)1/2 /0.0000053, b4 /b 4?(Lyy / Lx4x4)1/2 /0.9470(187.788/44701.175)1/2 /0.0648, and the constant b0 in accordance with Eq. (7), b0 /Y¯ /S bi X¯ i /2.3504/(0.00149/730.191/ 0.0000053 / 313836.61 / 0.0648 / 54.1220) / /3.908 and finally, obtain the general linear regression equation: yˆ/ //3.908/0.00149X1/ 0.0000053X3/0.0648X4.

4. Discussion Not only can the principal component regression analysis overcome disturbance of collinearity and real face of the fact is exposed (e.g. that b1 / /7.52 /104 is corrected to b1 /0.00149 through principal component regression analysis indicates there is a positive correlation between the mortality of traffic accidents and the quantity of motors, as is in accordance with the fact), while original information is not lost yet (Table 4 shows that the cumulative variance proportion with three principal components goes to 100%, and namely the ‘best’ principal component regression equation y? ˆ3/ /0.969C1/0.148C2/0.121C3’ uses all original information). The B1?, B2? and B3? in the ‘best’ principal component regression equation are highly significant (P B/0.0005). It indirectly proves that the b1?, b3? and b4? in the standardized linear regression equation also are highly significant since the each principal component includes the information of the standardized independent variables X1?, X3? and

147

X4?. It is inferred that the b1, b3 and b4 in the general linear equation are highly significant by the same principle. Hence, we can do a factor analysis by the standardized partial regression coefficients bi?, and also do a prediction by using the general linear regression equation: yˆ/ / /3.908/0.00149X1/0.0000053X3/0.0648X4. We should use standardized independent variables Xi? when computing value of each principal component Ci as numeric expression is Ci / ai 1X1?/ai 2X2?/. . ./aip Xp? , and should not use raw independent variables Xi . If value of principal component Ci is computed with the raw independent variables Xi , it will result in the complete correlation among principal components Ci (RCi Cj /1 or /1 i "/j ). In multiple linear regression analysis, when there is a phenomenon in which results differ from the fact, it will usually be suspected there are multicollinearities among independent variables. At that time, you can use the above method to analyze. The principal component regression analysis with SPSS is an effective method. Not only can it diagnose collinearity for each independent variable, but also solve the collinearity problem. At the same time, the majority of computation procedures are completed with help of computer, which greatly reduces the complicated manual, and simplified, speeded up and accurate statistical effect is reached at last.

References [1]

SPSS Inc., SPSS Base 10.0 Applications Guide, SPSS Inc., USA, 1999. [2] N.R. Draper, H. Smith, Applied Regression Analysis, 2nd ed., John Wiley & Sons Inc, New York, 1981. [3] SPSS Inc., SPSS Base 10.0 User’s Guide, SPSS Inc., USA, 1999.