SPSS Basic Guidance 3

SPSS Basic Guidance
3
Crosstab & chi-square analysis AND Bivariate
data analysis: Correlation and regression
Content:
1. Crosstab & chi-square analysis
2. Bivariate data analysis: Correlation and
regression
3. Extra
Extra 1: Chi-square value
Extra 2: Expected count
Disclaimer: This is a student’s work. If there is any mistake can contact the author
1. Crosstab and chi square analysis
Uses: To determine whether there is any association between 2 qualitative variables
A researcher wanted to test whether by providing vaccine will made a difference on

the occurrence of pneumonia. He recruited two groups of people:
Group 1: Unvaccinated (control group), N = 92
Group 2: Vaccinated (intervention group), N = 92
There are 3 health outcomes that can be observed:

1) Contracted with pneumococcal pneumonia
2) Contracted with non-pneumococcal pneumonia
3) Did not contacted with pneumonia
The data collected is as below:
Health outcome Unvaccinated (N= 92) Vaccinated (N= 92)

Contracted with 23 5
pneumococcal
pneumonia
Contracted with non- 8 10
pneumococcal
pneumonia
Did not contacted with 61 77
pneumonia
a. How to key in data?
Label the data and make sure the ‘measure’

Key in the data like this part is correct
- Different column,
different features 
Data > Weight cases

Note: weight cases is only for
data that have inserted into
table form
 If the data is a full spread of
value, then no need to do
weight case (I’ll show it later)

 It is ok to
 interchange
between row
 and column
Analyze > Descriptive > Crosstabs…


At the ‘cells…’ At the ‘statistic’

Note:
However, if the data spread out like this, then the weight
case step can be omitted
b. What to write/
Interpretation:
Ho = there is no
association between the
health outcome and
vaccination status
H1 = there is an
association between the
health outcome and
vaccination status
 from the crosstabulation, we can see there is no

expected count that is less than 5
 This means the outcome of the test is valid
 The degree of freedom is 2 and the Pearson Chi-

Square value is 13.649 which is quite large
Note: the larger the Chi-square value, the data
deviate more from independence (means greater
association)
 However, to determine how big the Chi-Square
value is considered as large (significantly large), it depends on the p-value
 The p-value shown at there is 0.001 which is < 0.05
 Thus, reject null hypothesis and accept alternative hypothesis.
 there is an association between the health outcome and vaccination status
 Now we know there is an association, so we need to know is the association

strong or weak? Thus, we need to look at the Cramer’s V
 The Cramer’s V is 0.272 which is consider as small

Note: Cramer’s V value range from 0 to 1 (<0.4 = weak, 0.4 – 0.6 = moderate,
>0.6 = strong)
 Thus, there is a weak association between health outcome and vaccination
status.
Conclusion: There is an association between the occurrence of pneumonia and vaccination

status, but it is just a weak association.
2. Bivariate data analysis: correlation and regression

Uses: To determine whether there is any association between 2 quantitative
variables.
Before we go deep into example, we must know there are 6 assumptions that we
must met when doing regression analysis (a.k.a. the 6 checklist)
1 Quantitative data 4 Independence of observation
2 Linear relationship 5 Homoscedasticity
3 No significant outliers 6 Residuals are normally

distributed
A bank loan officer wanted to know the association between income and value of car
purchased by his customer. 20 customers were randomly picked, their income and
value of car bought were tabulated as below. Income and price are in $’000.
Customer Income Price Customer Income Price

1 70 187 11 61 162
2 61 158 12 70 197
3 85 226 13 67 174
4 67 188 14 66 178
5 70 178 15 51 119
6 84 199 16 52 125
7 68 169 17 67 162
8 67 181 18 86 218
9 77 193 19 68 178
10 60 140 20 67 160
Objectives: To test
the association and relationship between income and price
a. How to key in data?
1.
 

At here we must know how to define which is y- axis and which is x-axis 
- Y-axis = independent variable / Predictor variable
- X-axis = dependent variable / Outcome variable
From this example we know that the income actually will affect the price of car
that they bought
So independent = income
Dependent = price
Note: there are some situations that has no independent var. or dependent var.
Eg: find association between the height and weight of a person
In this situation, we do not know which affect which one, so it is ok to interchange 

between the two variables
Outcome:
2.
2 Linear relationship
No significant
Analyze > Regression > Linear… outliers
3

At ‘statistic’


Outcome:
These data are important for us to

see the relationship, make equation
and prediction
Outcome:
4.
At ‘Plots’ 5
Homoscedasticity
ZRESID:
- Standardized residuals
- Y-axis
ZPRED:
- Standardized predicted values
- X-axis
3.
This will give you a new column for residual

- Important to use for normality
test (6th assumption)

Outcome:
- Explore > check the normality for residual
6 Residuals are normally distributed

How about the 1st and 4th assumption?
 They are easily achievable by just looking at the situation
1. First, we can see the income and price both are specific numbers which means
they are quantitative: they have the value (Note: not the same as
ordinal data)
1 Quantitative data
2. Next, occurrence of one observation provides no information about the

occurrence of the other observation.
 In a simple way, it means the income and price from customer 1,
does not affect (no relationship) with the income and price from
customer 2, 3, 4…..20
6 Independence of observation
2. What to write? / Interpretation
Ho = There is no association between Income and price of car bought
H1 = There is an association
between Income and price of
car bought
 From the scatter plot,

we can see the dots are
arranged in a straight
line pattern, thus there
might be an association
between price and
income
 It is a positive association (line goes up)
 The higher the income of the customer, the higher the price of car they have bought
 Visibly there is no major outlier
Note:
 If the points are too scatter, it may indicate there is no association or weak
association
 Negative association also
may be possible
 The mean price of the

car bought by 20
customers are $174.60 ±
27.10 K
 The mean income of 20
customers are $68.20 ±
9.40 K
 The r value is 0.932

which is more than 0.8, thus there is a strong association between price and income.
Note:
 r-value range from 0 to 1 (positive association) / -1 (negative association)
 0 – 0.3 = weak association
 0.4 – 0.7 = moderate association
 0.8 – 1 = strong association
Until here, all the above for number 2 is
explain about association.
From here onward, all the below
for number 2 is explain about
relationship.
 The R2 value shown is 0.868, which means 86.8% of the variation in Price is
explained by Income.
Note:
R square = coefficient of determination
 The p-value is less than 0.001 which is also less than 0.05. Thus, reject null
hypothesis, accept alternative hypothesis.
 There is an association between the income and the price of car.
Note:
Normally when the p-value shown is 0.000 we will not report it like this, but we will report it
as ‘less than 0.001’.
The equation:
Price = -8.733 + 2.688(Income)
Y = mX + C
 For every increase in income for 1 unit, the predicted price increased by
$2.688 K.
 In residual plot, all the points are falling within ± 3 and they
are scattered
randomly
 Thus the data is assumed homoscedasticity
Make prediction:
 The equation Price = -8.733 + 2.688(Income), can be used to
predict the Price given Income
 In the data
- Min income = 51
- Max income = 86
 Thus, the prediction can only made between 51 and 86.

 The average income is 68.20, so if the income value closer to the
average, the prediction will be more reliable
 Meanwhile, if the prediction closer to 51 (min) and 86 (max), it
will be less reliable
 The number of sample data is 20 (< 50) thus Shapiro-Wilk test is

used
 The p-value shown is 0.796 which is >0.05
 Thus, normality is assumed. (for the final assumption)
Note:
If it is not normal, it means the association test result might not be valid.
3. Extra
Extra 1: Chi-square value
Manual Chi-square calculation

2 2
2 ( observed1 −expected 1) ( observed2 −expected 2 )
x= + +…
expected 1 expected 2
Note:
The larger the Chi-square value, the data deviate more from independence
(means greater association)
Extra 2: Expected count
To be noted: all expected count must be > 5 so that the result will be valid.
But, what to do if the expected count < 5?
Example 1: When it is possible to ‘pool’ or ‘collapse’ data into fewer group
A researcher wanted to test the association between the fitness rate and the trouble falling
asleep within a community. A cross-sectional study was carried out. The recruited samples
were required to rate their fitness rate from the scale 1 to 10 (ordinal) and ‘yes’ or ‘no’
question for the trouble falling asleep (nominal)
Eg:
1. Overall how would you rate your physical fitness? (ordinal)
1 2 3 4 5 6 7 8 9
10
2. Do
you have trouble falling asleep? (nominal)

(yes / no)
Below is part of the data collected:
After
we do
the
crosstab, here is the outcomes:
 We find out there are 7 cells with expected count less than 5
 The result is not valid!
 However, it is possible to ‘pool’ the data together! (ordinal data)
How to ‘pool’ the data?


Transform > Recode into different
variables
- Important to use ‘different
variables, so it can give you
another new column of
recoded values (you will see
it later)
- In ‘old and new values’

- Recode or regroup the values into appropriate
categories
Note:
- Not every time you can accurately make groups that can
Outcome: produce expected count > 5
- Thus, try and error method should be used (try to pool
again into lesser group, with broader range every time if
you failed to produce a valid result)
We can rename them, so it will be

You will get a new column with newly
easier when we wanted to report
recoded fitness rate as 1, 2, 3 (just as
them.
we newly regroup them)
Ok, after we have done recoding, now let’s move on to the main business. The crosstabs!
 Yes! There is no expected value, that is less than 5
 The result will be valid!
 We can use this for the rest of the test
Example 2: What if it is
impossible to ‘pool’ the data?
A researcher wanted to test if

there is any association between
the personality and the colour preference. Few students have been recruited. Cross-
sectional studies have been carried out. The students required to label themselves as
‘Introvert’ or ‘extrovert’ and choose 1 out of 4 of their colour preferences (red, yellow,
green or blue).
The data collected is as below
Students Personality Colour preference

1 Introvert Yellow
2 Introvert Red
3 Introvert Yellow
4 Introvert Green
5 extrovert Blue
6 extrovert red
… … …
After that we do the crosstabs and we found out:

 There are 2 cells with expected count of less than 5
 The results will be not valid
 However, we cannot recode the data
Why cannot recode?

- Because colours are colours, we cannot recode yellow
and red into a group! Impossible!
Thus, the only way to solve is by INCREASING the sample size. Recruit more people!

SPSS Basic Guidance 3

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SPSS Basic Guidance 3

Uploaded by

Copyright:

Available Formats

SPSS Basic Guidance

A researcher wanted to test whether by providing vaccine will made a difference on

There are 3 health outcomes that can be observed:

The data collected is as below:

Health outcome Unvaccinated (N= 92) Vaccinated (N= 92)

a. How to key in data?

Label the data and make sure the ‘measure’

Data > Weight cases

Analyze > Descriptive > Crosstabs…

At the ‘cells…’ At the ‘statistic’

 from the crosstabulation, we can see there is no

 The degree of freedom is 2 and the Pearson Chi-

 Now we know there is an association, so we need to know is the association

 The Cramer’s V is 0.272 which is consider as small

Conclusion: There is an association between the occurrence of pneumonia and vaccination

2. Bivariate data analysis: correlation and regression

1 Quantitative data 4 Independence of observation

2 Linear relationship 5 Homoscedasticity

3 No significant outliers 6 Residuals are normally

Customer Income Price Customer Income Price

a. How to key in data?

Eg: find association between the height and weight of a person

In this situation, we do not know which affect which one, so it is ok to interchange 

These data are important for us to

This will give you a new column for residual

- Explore > check the normality for residual

6 Residuals are normally distributed

2. Next, occurrence of one observation provides no information about the

 From the scatter plot,

 The mean price of the

 The r value is 0.932

Price = -8.733 + 2.688(Income)

 Thus, the prediction can only made between 51 and 86.

 The number of sample data is 20 (< 50) thus Shapiro-Wilk test is

Manual Chi-square calculation

Extra 2: Expected count

But, what to do if the expected count < 5?

Example 1: When it is possible to ‘pool’ or ‘collapse’ data into fewer group

you have trouble falling asleep? (nominal)

Below is part of the data collected:

crosstab, here is the outcomes:

How to ‘pool’ the data?

- In ‘old and new values’

We can rename them, so it will be

A researcher wanted to test if

Students Personality Colour preference

After that we do the crosstabs and we found out:

Why cannot recode?

You might also like