Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 18

SPSS Basic Guidance

3
Crosstab & chi-square analysis AND Bivariate
data analysis: Correlation and regression

Content:
1. Crosstab & chi-square analysis
2. Bivariate data analysis: Correlation and
regression
3. Extra
Extra 1: Chi-square value
Extra 2: Expected count

Disclaimer: This is a student’s work. If there is any mistake can contact the author
1. Crosstab and chi square analysis
Uses: To determine whether there is any association between 2 qualitative variables

A researcher wanted to test whether by providing vaccine will made a difference on


the occurrence of pneumonia. He recruited two groups of people:
Group 1: Unvaccinated (control group), N = 92
Group 2: Vaccinated (intervention group), N = 92

There are 3 health outcomes that can be observed:


1) Contracted with pneumococcal pneumonia
2) Contracted with non-pneumococcal pneumonia
3) Did not contacted with pneumonia

The data collected is as below:

Health outcome Unvaccinated (N= 92) Vaccinated (N= 92)


Contracted with 23 5
pneumococcal
pneumonia
Contracted with non- 8 10
pneumococcal
pneumonia
Did not contacted with 61 77
pneumonia

a. How to key in data?

Label the data and make sure the ‘measure’


Key in the data like this part is correct
- Different column,
different features 

Data > Weight cases


Note: weight cases is only for
data that have inserted into
table form
 If the data is a full spread of
value, then no need to do
weight case (I’ll show it later)

 It is ok to
 interchange
between row
 and column

Analyze > Descriptive > Crosstabs…


At the ‘cells…’ At the ‘statistic’


Note:
However, if the data spread out like this, then the weight
case step can be omitted
b. What to write/
Interpretation:

Ho = there is no
association between the
health outcome and
vaccination status
H1 = there is an
association between the
health outcome and
vaccination status

 from the crosstabulation, we can see there is no


expected count that is less than 5
 This means the outcome of the test is valid

 The degree of freedom is 2 and the Pearson Chi-


Square value is 13.649 which is quite large
Note: the larger the Chi-square value, the data
deviate more from independence (means greater
association)
 However, to determine how big the Chi-Square
value is considered as large (significantly large), it depends on the p-value
 The p-value shown at there is 0.001 which is < 0.05
 Thus, reject null hypothesis and accept alternative hypothesis.
 there is an association between the health outcome and vaccination status

 Now we know there is an association, so we need to know is the association


strong or weak? Thus, we need to look at the Cramer’s V

 The Cramer’s V is 0.272 which is consider as small


Note: Cramer’s V value range from 0 to 1 (<0.4 = weak, 0.4 – 0.6 = moderate,
>0.6 = strong)
 Thus, there is a weak association between health outcome and vaccination
status.

Conclusion: There is an association between the occurrence of pneumonia and vaccination


status, but it is just a weak association.

2. Bivariate data analysis: correlation and regression


Uses: To determine whether there is any association between 2 quantitative
variables.

Before we go deep into example, we must know there are 6 assumptions that we
must met when doing regression analysis (a.k.a. the 6 checklist)

1 Quantitative data 4 Independence of observation

2 Linear relationship 5 Homoscedasticity

3 No significant outliers 6 Residuals are normally


distributed

A bank loan officer wanted to know the association between income and value of car
purchased by his customer. 20 customers were randomly picked, their income and
value of car bought were tabulated as below. Income and price are in $’000.

Customer Income Price Customer Income Price


1 70 187 11 61 162
2 61 158 12 70 197
3 85 226 13 67 174
4 67 188 14 66 178
5 70 178 15 51 119
6 84 199 16 52 125
7 68 169 17 67 162
8 67 181 18 86 218
9 77 193 19 68 178
10 60 140 20 67 160

Objectives: To test
the association and relationship between income and price

a. How to key in data?

1.

 

At here we must know how to define which is y- axis and which is x-axis 
- Y-axis = independent variable / Predictor variable
- X-axis = dependent variable / Outcome variable

From this example we know that the income actually will affect the price of car
that they bought

So independent = income
Dependent = price

Note: there are some situations that has no independent var. or dependent var.

Eg: find association between the height and weight of a person

In this situation, we do not know which affect which one, so it is ok to interchange 


between the two variables
Outcome:
2.

2 Linear relationship

No significant
Analyze > Regression > Linear… outliers
3

At ‘statistic’


Outcome:

These data are important for us to


see the relationship, make equation
and prediction
Outcome:
4.

At ‘Plots’ 5
Homoscedasticity
ZRESID:
- Standardized residuals
- Y-axis
ZPRED:
- Standardized predicted values
- X-axis

3.

This will give you a new column for residual


- Important to use for normality
test (6th assumption)


Outcome:

- Explore > check the normality for residual

6 Residuals are normally distributed


How about the 1st and 4th assumption?
 They are easily achievable by just looking at the situation

1. First, we can see the income and price both are specific numbers which means
they are quantitative: they have the value (Note: not the same as
ordinal data)
1 Quantitative data

2. Next, occurrence of one observation provides no information about the


occurrence of the other observation.
 In a simple way, it means the income and price from customer 1,
does not affect (no relationship) with the income and price from
customer 2, 3, 4…..20
6 Independence of observation
2. What to write? / Interpretation
Ho = There is no association between Income and price of car bought
H1 = There is an association
between Income and price of
car bought

 From the scatter plot,


we can see the dots are
arranged in a straight
line pattern, thus there
might be an association
between price and
income
 It is a positive association (line goes up)
 The higher the income of the customer, the higher the price of car they have bought
 Visibly there is no major outlier
Note:
 If the points are too scatter, it may indicate there is no association or weak
association
 Negative association also
may be possible

 The mean price of the


car bought by 20
customers are $174.60 ±
27.10 K
 The mean income of 20
customers are $68.20 ±
9.40 K

 The r value is 0.932


which is more than 0.8, thus there is a strong association between price and income.

Note:
 r-value range from 0 to 1 (positive association) / -1 (negative association)
 0 – 0.3 = weak association
 0.4 – 0.7 = moderate association
 0.8 – 1 = strong association
Until here, all the above for number 2 is
explain about association.
From here onward, all the below
for number 2 is explain about
relationship.

 The R2 value shown is 0.868, which means 86.8% of the variation in Price is
explained by Income.

Note:
R square = coefficient of determination

 The p-value is less than 0.001 which is also less than 0.05. Thus, reject null
hypothesis, accept alternative hypothesis.
 There is an association between the income and the price of car.

Note:
Normally when the p-value shown is 0.000 we will not report it like this, but we will report it
as ‘less than 0.001’.
The equation:

Price = -8.733 + 2.688(Income)

Y = mX + C

 For every increase in income for 1 unit, the predicted price increased by
$2.688 K.

 In residual plot, all the points are falling within ± 3 and they
are scattered
randomly
 Thus the data is assumed homoscedasticity

Make prediction:
 The equation Price = -8.733 + 2.688(Income), can be used to
predict the Price given Income
 In the data
- Min income = 51
- Max income = 86

 Thus, the prediction can only made between 51 and 86.


 The average income is 68.20, so if the income value closer to the
average, the prediction will be more reliable
 Meanwhile, if the prediction closer to 51 (min) and 86 (max), it
will be less reliable

 The number of sample data is 20 (< 50) thus Shapiro-Wilk test is


used
 The p-value shown is 0.796 which is >0.05
 Thus, normality is assumed. (for the final assumption)
Note:
If it is not normal, it means the association test result might not be valid.
3. Extra
Extra 1: Chi-square value

Manual Chi-square calculation


2 2
2 ( observed1 −expected 1) ( observed2 −expected 2 )
x= + +…
expected 1 expected 2

Note:
The larger the Chi-square value, the data deviate more from independence
(means greater association)

Extra 2: Expected count

To be noted: all expected count must be > 5 so that the result will be valid.

But, what to do if the expected count < 5?

Example 1: When it is possible to ‘pool’ or ‘collapse’ data into fewer group

A researcher wanted to test the association between the fitness rate and the trouble falling
asleep within a community. A cross-sectional study was carried out. The recruited samples
were required to rate their fitness rate from the scale 1 to 10 (ordinal) and ‘yes’ or ‘no’
question for the trouble falling asleep (nominal)

Eg:
1. Overall how would you rate your physical fitness? (ordinal)
1 2 3 4 5 6 7 8 9
10

2. Do

you have trouble falling asleep? (nominal)


(yes / no)

Below is part of the data collected:

After
we do
the

crosstab, here is the outcomes:

 We find out there are 7 cells with expected count less than 5
 The result is not valid!
 However, it is possible to ‘pool’ the data together! (ordinal data)

How to ‘pool’ the data?



Transform > Recode into different
variables
- Important to use ‘different
variables, so it can give you
another new column of
recoded values (you will see
it later)

- In ‘old and new values’


- Recode or regroup the values into appropriate
categories
Note:
- Not every time you can accurately make groups that can
Outcome: produce expected count > 5
- Thus, try and error method should be used (try to pool
again into lesser group, with broader range every time if
you failed to produce a valid result)

We can rename them, so it will be


You will get a new column with newly
easier when we wanted to report
recoded fitness rate as 1, 2, 3 (just as
them.
we newly regroup them)

Ok, after we have done recoding, now let’s move on to the main business. The crosstabs!
 Yes! There is no expected value, that is less than 5
 The result will be valid!
 We can use this for the rest of the test

Example 2: What if it is
impossible to ‘pool’ the data?

A researcher wanted to test if


there is any association between
the personality and the colour preference. Few students have been recruited. Cross-
sectional studies have been carried out. The students required to label themselves as
‘Introvert’ or ‘extrovert’ and choose 1 out of 4 of their colour preferences (red, yellow,
green or blue).
The data collected is as below

Students Personality Colour preference


1 Introvert Yellow
2 Introvert Red
3 Introvert Yellow
4 Introvert Green
5 extrovert Blue
6 extrovert red
… … …

After that we do the crosstabs and we found out:


 There are 2 cells with expected count of less than 5
 The results will be not valid
 However, we cannot recode the data

Why cannot recode?


- Because colours are colours, we cannot recode yellow
and red into a group! Impossible!

Thus, the only way to solve is by INCREASING the sample size. Recruit more people!

You might also like