Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

www.elsevier.com/locate/cmpb

Principal component regression analysis with SPSS

R.X. Liu a,*, J. Kuang b, Q. Gong a, X.L. Hou c


a
Medical College of Jinan University, Guangzhou 510632, People’s Republic of China
b
Guangdong Provincial People’s Hospital, Guangzhou 510080, People’s Republic of China
c
Jinan University Library, Guangzhou 510632, People’s Republic of China

Received 30 March 2001; received in revised form 10 April 2002; accepted 11 April 2002

Abstract

The paper introduces all indices of multicollinearity diagnoses, the basic principle of principal component regression
and determination of ‘best’ equation method. The paper uses an example to describe how to do principal component
regression analysis with SPSS 10.0: including all calculating processes of the principal component regression and all
operations of linear regression, factor analysis, descriptives, compute variable and bivariate correclations procedures in
SPSS 10.0. The principal component regression analysis can be used to overcome disturbance of the multicollinearity.
The simplified, speeded up and accurate statistical effect is reached through the principal component regression analysis
with SPSS.
# 2002 Elsevier Science Ireland Ltd. All rights reserved.

Keywords: Multicollinearity diagnosis; Principal component regression analysis; SPSS

1. Introduction that symbol and value of actual regression coeffi-


cient are not consistent with the expected ones.
In multivariate analysis, the least-squares The often-used index to justify collinearity is
method is generally adopted in fitting a multiple simple correlation coefficient. When simple corre-
linear regression model, but estimation of the lation coefficient between two independent vari-
least-squares is sometimes far from being perfect. ables is large, the collinearity is considered. Apart
One of important causes leading to the result is from simple correlation coefficient, SPSS provides
column vectors of matrix X is close to linear collinearity statistics ([1], pp. 221): tolerance and
correlation. Approximate linear relationship variance inflation factor (VIF). Tolerance /1/
among independent variables is called multicolli- R2i , where R2i is squared multiple correlation of
nearity. That there exists multicollinearity among ith variable with other independent variables.
independent variables tends to lead to the result When its value is small (close to 0), the variable
is almost a linear combination of the other
independent variables. VIF is reciprocal of toler-
* Corresponding author. Tel.: /86-20-8522-0259; fax: /86- ance. Variables with low tolerance tend to have
20-8522-1343.
E-mail address: trxliu@263.net (R.X. Liu).
large VIF, so variables with low tolerance and

0169-2607/03/$ - see front matter # 2002 Elsevier Science Ireland Ltd. All rights reserved.
doi:10.1016/S0169-2607(02)00058-5
142 R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

large VIF suggest that they have a collinearity. (2) Proceed a principal component analysis with
Eigenvalue, condition index and variance propor- the p independent variables for transforming a set
tion are also indices of collinearity diagnosis ([1], of correlated variables to a set of uncorrelated
pp. 229 /230). Eigenvalues provide an indication principal components and indicating information
of how many distinct dimensions there are among quantities of different set of principal components.
independent variables. When several eigenvalues (3) Compute the standardized dependent vari-
are close to 0, the variables are highly intercorre- able, the p standardized independent variables and
lated and the matrix is said to be ill-conditioned. the values of the p principal components respec-
Condition indices are square roots of ratios of the tively according to Eqs. (1) /(3) for making a
largest eigenvalue to each successive eigenvalue. A preparation of setting up p standardized principal
condition index greater than 15 indicates a possi- component regression equations.
ble problem and an index greater than 30 suggests
a serious problem with collinearity. Variance Y ?(Y  Ȳ )=SY (1)
proportions are proportions of variance of esti- X ?(X
i i  X̄ i )=SXi (i 1; . . .; p) (2)
mate accounted for by each principal component Ci  ai1 X 1? ai2 X 2? . . .aip X p?
associated with each of eigenvalues. A component (3)
associated with a high condition index contributes (i  1; . . .; p)
substantially to variance of two or more variables, where Y ? stands for the standardized dependent
so independent variables with large variances are variable, Y the dependent variable, SY the stan-
the ones being highly intercorrelated. The princi- dard deviation of dependent variable, Ȳ the mean
pal component regression is the method of com- of dependent variable, Xi? the ith standardized
bining linear regression with principal component independent variable, Xi the ith independent
analysis ([2], pp. 327/332). The principal compo- variable, X̄ i the mean of the ith independent
nent analysis can gather highly correlated inde- variable, SXi the standard deviation of the ith
pendent variables into a principal component, and independent variable, Ci the ith principal compo-
all principal components are independent of each nent, aij the coefficient of principal component
other, so that all it does is to transform a set of matrix (the matrix consists of Ci and Xi?).
correlated variables to a set of uncorrelated (4) Built the standardized principal component
principal components. Then we built the regres- regression equation with the first principal com-
sion equations with a set of uncorrelated principal ponent, then add principal component backwards
components and get the ‘best’ equation according one by one to get the m standardized principal
to the principle of the maximum adjusted R2 and component regression equations, as shown in Eq.
minimum standard error of estimate. At last the (4), meanwhile check whether all principal compo-
‘best’ equation is transformed into the general nents are independent of each other or not, then
linear regression equation. The present paper will determine the ‘best’ standardized principal com-
demonstrate how multicollinearity problem is ponent regression equation in Eq. (4) on the basis
solved by using SPSS 10.0 to do the principal of the maximum adjusted R2 and minimum
component regression [3]. standard error of estimate by comparing the
adjusted R2 and standard error of estimate of
each standardized principal component regression
2. Basic principle and formulas equation.
X
(1) Proceed a stepwise regression with a depen- ŷ?j  B?i Ci (j  1; :::; m 5p; i 1; :::; K 5 p)
dent variable Y and all independent variables X
for getting the p independent variables with
(4)
statistical significances (P B/0.05) and revealing
whether the p independent variables have a multi- where ŷ?j is the estimate of the j th standardized
collinearity or not. principal component regression equation, Bi? the
R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147 143

ith standardized partial regression coefficient of persons) of passenger transport, the mileage
the standardized principal component regression (10000 km) of motor’s running on formal highway,
equation. and the mileage (10000 km) of motor’s running on
(5) Applying Eq. (3) to the ‘best’ standardized informal highway are respectively expressed as the
principal component regression equation yields the dependent variable Y and independent variables
standardized linear regression equation after sort- X1, X2, X3, X4 and X5. Table 1 displays the mean
ing it out, as shown in Eq. (5). and standard deviation (S.D.) of all variables.
X
ŷ? b?i X i? (i 1; :::; K 5 p) (5) (1) Select the significant independent variables
(P B/0.05) with SPSS backward and diagnose the
where ŷ? is the estimate of the standardized linear multicollinearity for each independent variable
regression equation, b?i the ith standardized partial ([3], pp. 299/308).
regression coefficient of the standardized linear In the SPSS linear regression dialog box, enter ‘y ’
regression equation. (the dependent variable) into the dependent box
(6) Compute partial regression coefficients and and ‘x1, x2, x3, x4 and x5’ (all independent
constant, as shown in Eqs. (6) and (7), at last, variables) into the independent box, select the
transform the standardized linear regression equa- backward in the method selection control. In the
tion into the general linear regression equation, as SPSS linear regression: statistics dialog box, click
shown in Eq. (8).
on Descriptives, Covariance matrix and
bi  b?(Lyy=Lx
i x )1=2 (i 1; . . .; K 5p) (6) Collinearity diagnostics, while the others are the
X i i
b0  Ȳ bi X̄ i (i 1; . . .; K 5p) (7) items assumed by SPSS. After running the SPSS
X linear regression procedure, obtain the results of
ŷ b0  bi Xi (i 1; . . .; K 5 p) (8) Tables 2 and 3.
where bi is the ith partial regression coefficient of Table 2 displays that partial regression coeffi-
the general linear regression equation, Lyy the cients b1, b3 and b4 of three independent variables
sum of squares of dependent variable Y , Lxi xi the (X1, X3 and X4) are highly significant (P B/0.0005)
sum of squares of the ith independent variable Xi , and b1 is equal to /7.52 /104, which indicates
b0 the constant of the general linear regression there is a negative correlation between the mortal-
equation. ity of traffic accident and the quantity of motors.
The result is contrary to the common sense. We
will check whether there are multicollinearities
3. Example among the independent variables. It also displays
that toleranceX1 and toleranceX3 are small (0.04
Between the years of 1951 and 1998 (in which and 0.022), VIFX1 and VIFX3 are large (25.233 and
the data of 1969 and 1986 are not indexed), the 46.402). Table 3 shows that RX1,X3 is large (/
mortality (1/100000) of traffic accident in the 0.950), the 4th eigenvalue is close to 0 (0.007352),
mainland of China each year, the quantity condition index is more than 15 (21.362) and the
(10000 vehicles) of motors, the quantity (10000 variance proportions of the independent variables
tons) of freight transport, the quantity (10000 X1 and X3 are large (0.88 and 0.99). These facts

Table 1
The mean and S.D. of all variables

Y X1 X2 X3 X4 X5

Mean 2.3504 730.1911 324111.65 313836.61 54.1220 18.5461


SD 2.0428 1196.0913 347739.798 372520.677 31.5176 5.6636
144 R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

Table 2
Partial regression coefficient and collinearity statistics of the linear regression equation

Model bi t P Collinearity statistics

Tolerance VIF

Constant /8.94/10 2 /0.773 0.444


X1 /7.52/10 4 /3.972 0.000 0.040 25.233
X3 6.132/10 6 7.439 0.000 0.022 46.402
X4 1.967/10 2 4.891 0.000 0.126 7.906

Table 3
Indices of collinearity diagnosis and simple correlation coefficient RXi Xj of the linear regression equation

Dimension Eigenvalue Condition index % of variance

Constant X1 X3 X4

1 3.355 1.000 0.01 0.00 0.00 0.00


2 0.572 2.422 0.14 0.02 0.00 0.00
3 0.06533 7.166 0.56 0.10 0.01 0.24
4 0.007352 21.362 0.29 0.88 0.99 0.76

Table 4
The Eigenvalue, % of variance and coefficients for each principal component

Principal component Eigenvalue % of variance % of cumulative variance Standardized independent variable

X1? X3? X4?

1 2.746 91.523 91.523 0.954 0.993 0.922


2 0.241 8.033 99.556 /0.292 /0.0787 0.387
3 0.01333 0.444 100.000 0.06481 /0.0906 0.03044

clearly indicate that there is a collinearity between Table 4 displays that the cumulative variance
X1 and X3. proportion of one principal component (the 1st
(2) Use the SPSS factor analysis procedure to principal component C1) is 91.523%, the one of
obtain the principal component matrix of the two principal components (C1 and C2) is 99.556%
independent variables X1, X3 and X4 and the and the one of three principal components (C1, C2
cumulative variance proportion of different prin- and C3) is 100.000%. In Table 4, obtain coeffi-
cipal components ([3], pp. 323-334). cients (aij ) related the three standardized indepen-
In the SPSS factor analysis dialog box, enter ‘x1, dent variables to three principal components to
x3 and x4’ (the independent variables X1, X3 and create expressions of three principal components:
X4) into the variable box. In the factor analysis C1 /0.954X1?/0.993X3?/0.922X4?, C2 //
extraction dialog box, click on Number of 0.292X1/0.0787X3/0.387X4, C3 /0.06481X1?/
? ? ?
factors, and type ‘3’ into the box, whereas the 0.0906X3?/0.03044X4?.
others are the items assumed by SPSS. All results (3) Obtain the standardized dependent variable
are shown in Table 4 after running the factor Y ?, the standardized independent variables X1?, X3?
analysis. and X4? by using the SPSS descriptives procedure
R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147 145

([3], pp. 219 /222) and the value of each principal all principal components are independent of each
component Ci according to expression of each other or not and determine the ‘best’ standardized
principal component and by using the SPSS com- principal component regression equation ([3], pp.
pute variable procedure ([3], pp. 89 /92). 299/308).
In the SPSS descriptives dialog box, enter ‘y, x1, In the SPSS linear regression dialog box, enter
x3 and x4’ (the dependent variable Y and the ‘zy and c1’ into the dependent and the independent
independent variables X1, X3 and X4) into the boxes respectively. In the SPSS linear regression:
variable[s]: box, click on Save standardized statistics dialog box, click on Covariance matrix
values as variables and click on OK button to and Collinearity diagnostics, while the others are
run the SPSS descriptives procedure to create the the items assumed by SPSS. Thus, generate the 1st
standardized dependent variable zy and the stan- standardized principal component regression
dardized independent variables zx1, zx3 and zx4 in equation: ŷ?1/ /B1?C1. Following the same steps, fit
the current working data file. the equations: ŷ?2/ /B1?C1/B2?C2 and ŷ?3/ /B1?C1/
In the SPSS compute variable dialog box, type B2?C2/B3?C3. The differences of their operations
‘c1’ into the target box as the variable name of the
are that in the independent box, the former is
first principal component C1, type ‘0.954+zx1/
entered into ‘c1 and c2’, and the latter is entered
0.993+zx3/0.922+zx4’ into the numeric expression
into ‘c1, c2 and c3’. After running the SPSS linear
box, click on OK button to run the SPSS compute
regression procedure, all results are shown in
variable procedure to create a new variable c1 and
its value in the current working data file. When the Tables 5 /7, respectively.
second and third principal components C2 and C3 In Table 5, obtain all standardized partial
are computed, type ‘c2’ and ‘c3’ into the target regression coefficients Bi? with high significance
boxes, and type ‘/0.292+zx1/0.0787+zx3/ (P B/0.0005) of all principal components Ci in all
0.387+zx4’ and ‘0.06481+zx1/0.0906+zx3/ models (equations) to generate three standardized
0.03044+zx4’ into the numeric expression box principal component regression equations: ŷ?1/ /
respectively. After running the SPSS compute 0.971C1, ŷ?2/ /0.970C1/0.148C2 and ŷ?3/ /
variable procedure one by one, generate the new 0.969C1/0.148C2/0.121C3. Table 5 displays all
variables c2 and c3 and their values in the current simple correlation coefficients RCi Cj of all princi-
working data file. pal components Ci are close to 0 and their
(4) Use the SPSS linear regression procedure to tolerances and VIFs are equal to 1. Table 6 shows
do the principal component regression analysis: that their eigenvalues and condition indices are
includes to build each standardized principal close to 1. These suggest that all principal compo-
component regression equation, check whether nents are independent of each other.

Table 5
Standardized partial regression coefficients, collinearity statistics and correlation coefficients RCi Cj of three standardized principal
component regression equations

Model Bi? t P Collinearity statistics

Tolerance VIF

1 C1 0.971 26.935 0.000 1.000 1.000


2 C1 0.970 33.907 0.000 1.000 1.000
C2 0.148 5.189 0.000 1.000 1.000
RC1,C2 //0.009
3 C1 0.969 43.925 0.000 1.000 1.000
C2 0.148 6.726 0.000 1.000 1.000
C3 /0.121 /5.496 0.000 1.000 1.000
RC1,C2 //0.009 RC1,C3 /0.002 RC2,C3 /0.000
146 R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147

Table 6
Indices of collinearity diagnosis of three standardized principal component regression equations

Dimension Eigenvalue Condition index % of variance

C1 C2 C3

1 1.000 1.000 0.00


2 1.000 1.000 1.00
1 1.009 1.000 0.50 0.50
2 1.000 1.005 0.00 0.00
3 0.991 1.009 0.50 0.50
1 1.009 1.000 0.50 0.47 0.03
2 1.000 1.005 0.00 0.07 0.93
3 1.000 1.005 0.00 0.00 0.00
4 0.990 1.010 0.50 0.47 0.04

R2 is a measure of goodness of fit of a linear (5) Using the SPSS bivariate correlations proce-
model and tends to be an overestimate of popula- dure to compute Lyy and Lxi xi ([3], pp. 285/290).
tion parameter ([1], pp. 197 /198, 208). R2 ranges In the SPSS bivariate correlations dialog box, enter
from 0 to 1. The closer to 1, the better on goodness ‘y , x1, x3 and x4’ (the dependent variable Y and
of fit of a linear model. As R2 is affected by the the independent variables X1, X3 and X4) into the
number of independent variables in model and variables box. In the bivariate correlation: options
sample size, we usually use the adjusted R2 when dialog box, click on Cross-product deviation and
comparing the goodnesses of fit between different covariance. Run the bivariate correlations proce-
linear models. Adjusted R2 is designed to compen- dure to get Lyy /187.788, Lx1x1 /64378550,
sate for the optimistic bias of R2. Standard error of Lx3x3 /6.245 /1012 and Lx4x4 /44701.175.
estimate is the square root of the residual mean (6) Transform the ‘best’ standardized principal
square and measures the spread of the residuals component regression equation into the standar-
about the fitted line ([1], p. 198), so it is also a dized linear regression equation and then into the
measure of goodness of fit of a linear model. The general linear regression equation.
closer to 0, the better on goodness of fit of a linear From Table 4, get C1 /0.954X1?/0.993X3?/
model. ŷ?3/ /0.969C1/0.148C2/0.121C3 is deter- 0.922X4?, C2 //0.292X1?/0.0787X3?/0.387X4?
mined as the ‘best’ equation, as Table 7 shows that and C3 /0.06481X1?/0.0906X3?/0.03044X4?, and
adjusted R2 (0.978) and standard error of estimate apply them to the ‘best’ standardized principal
(0.1480) of the third standardized principal com- component regression equation: ŷ?/ /0.969C1/
ponent regression equation is the largest and the 0.148C2/0.121C3 /0.969 (0.954X1?/0.993X3?/
smallest in the three equations respectively, and its 0.922X4?)/0.148(/0.292X1?/0.0787X3?/0.387X4?)
F value is equal to 670.541 and it is also highly /0.121(0.06481X1?/0.0906X3?/0.03044X4?), then
significant (P B/0.0005). after having sorted it out, obtain the standardized

Table 7
The expression, adjusted R2, standard error of estimate, F value and P value in each equation

Standardized principal component regression equation Adjusted R2 Standard error of estimate F P

/ŷ?1/ /0.971C1 0.942 0.2418 725.490 B/0.0005


/ŷ?2/ /0.970C1/0.148C2 0.963 0.1918 589.972 B/0.0005
/ŷ?3/ /0.969C1/0.148C2/0.121C3 0.978 0.1480 670.541 B/0.0005
R.X. Liu et al. / Computer Methods and Programs in Biomedicine 71 (2003) 141 /147 147

linear regression equation: /ŷ?/ /0.8734X1?/ X4?. It is inferred that the b1, b3 and b4 in the
0.9615X3?/0.9470X4?. general linear equation are highly significant by
Calculate the general partial regression the same principle. Hence, we can do a factor
coefficients bi with b1? /0.8734, b3? /0.9615 analysis by the standardized partial regression
and b4? /0.9470 in the light of Eq. (6), b1 /b1? coefficients bi?, and also do a prediction by using
(Lyy /Lx1x1)1/2 /0.8734(187.788/64378550)1/2 / the general linear regression equation: ŷ/ /
0.00149, b3 /b3?(Lyy /Lx3x3)1/2 /0.9615(187.788/ /3.908/0.00149X1/0.0000053X3/0.0648X4.
6.245 /1012)1/2 /0.0000053, b4 /b 4?(Lyy / We should use standardized independent vari-
Lx4x4)1/2 /0.9470(187.788/44701.175)1/2 /0.0648, ables Xi? when computing value of each principal
and the constant b0 in accordance with Eq. (7), component Ci as numeric expression is Ci /
b0 /Ȳ /S bi X̄ i /2.3504/(0.00149/730.191/ ai 1X1?/ai 2X2?/. . ./aip Xp? , and should not use
0.0000053 / 313836.61 / 0.0648 / 54.1220) / raw independent variables Xi . If value of principal
/3.908 and finally, obtain the general linear component Ci is computed with the raw indepen-
regression equation: ŷ/ //3.908/0.00149X1/ dent variables Xi , it will result in the complete
0.0000053X3/0.0648X4. correlation among principal components Ci
(RCi Cj /1 or /1 i "/j ).
In multiple linear regression analysis, when
4. Discussion there is a phenomenon in which results differ
from the fact, it will usually be suspected there are
Not only can the principal component regres- multicollinearities among independent variables.
sion analysis overcome disturbance of collinearity At that time, you can use the above method to
and real face of the fact is exposed (e.g. that b1 / analyze. The principal component regression ana-
/7.52 /104 is corrected to b1 /0.00149 through lysis with SPSS is an effective method. Not only can
principal component regression analysis indicates it diagnose collinearity for each independent vari-
there is a positive correlation between the mortal- able, but also solve the collinearity problem. At
ity of traffic accidents and the quantity of motors, the same time, the majority of computation
as is in accordance with the fact), while original procedures are completed with help of computer,
information is not lost yet (Table 4 shows that the which greatly reduces the complicated manual,
cumulative variance proportion with three princi- and simplified, speeded up and accurate statistical
pal components goes to 100%, and namely the effect is reached at last.
‘best’ principal component regression equation ŷ?3/
/0.969C1/0.148C2/0.121C3’ uses all original
information).
The B1?, B2? and B3? in the ‘best’ principal References
component regression equation are highly signifi-
cant (P B/0.0005). It indirectly proves that the b1?, [1] SPSS Inc., SPSS Base 10.0 Applications Guide, SPSS Inc.,

b3? and b4? in the standardized linear regression USA, 1999.


[2] N.R. Draper, H. Smith, Applied Regression Analysis, 2nd
equation also are highly significant since the each ed., John Wiley & Sons Inc, New York, 1981.
principal component includes the information of [3] SPSS Inc., SPSS Base 10.0 User’s Guide, SPSS Inc., USA,
the standardized independent variables X1?, X3? and 1999.

You might also like