Professional Documents
Culture Documents
SPSS BRM
SPSS BRM
SPSS/Statistical Component
DRAFT v0.10-0
Table of contents
Preface .................................................................................................................................................................................... 4
1 Introduction: SPSS........................................................................................................................................................... 5
1.1 SPSS case studies: A great way to learn! ................................................................................................................ 6
2 Introduction: Definitions................................................................................................................................................. 8
3 Introduction: Business “Reality” versus Experimental Quality ..................................................................................... 11
4 Survey Data ................................................................................................................................................................... 11
4.1 Coding Survey Data ............................................................................................................................................... 11
4.1.1 Codebook ...................................................................................................................................................... 11
4.1.2 Coding Open Text Response Items ............................................................................................................... 12
4.1.3 Coding Multiple Response (MR) Items ......................................................................................................... 13
4.1.4 Measure Types: Nominal, Ordinal, Scale ...................................................................................................... 15
4.1.5 Entering Data in SPSS: Variable View............................................................................................................ 16
4.1.6 Coding Missing Values .................................................................................................................................. 19
4.1.7 Entering Data in SPSS: Data View ................................................................................................................. 21
4.1.8 Multiple Response Items in SPSS .................................................................................................................. 21
4.1.8.1 Multiple Response: SPSS Variable Sets ..................................................................................................... 22
4.1.8.2 Multiple Response: SPSS Multiple Response (MR) Sets ........................................................................... 24
4.2 Analysis of Survey Data ......................................................................................................................................... 30
5 Statistical Experimentation Process: Overview ............................................................................................................ 32
6 Statistical Experimentation Process: Example .............................................................................................................. 33
7 SPSS: Selecting cases, and Splitting output .................................................................................................................. 34
7.1 Select Cases........................................................................................................................................................... 34
7.2 Split File ................................................................................................................................................................. 36
8 Statistical Summaries: Descriptive Statistics ................................................................................................................ 39
8.1 Frequency Tables .................................................................................................................................................. 39
8.2 Statistics ................................................................................................................................................................ 44
8.2.1 Middle/typical value: Mean, Median, Mode ................................................................................................ 46
8.2.2 Spread/variability/dispersion: Standard deviation, Quartile deviation, Percentile Ranges ......................... 46
8.2.3 Spread/variability/dispersion of a statistic! The key idea behind inferential statistics! .............................. 47
8.2.4 Meaningless Statistics: Just because you can, doesn’t mean you should! ................................................... 50
Page 1 of 102
8.3 Charts .................................................................................................................................................................... 51
8.3.1 Display or Comparison of Counts/Frequencies: Bar charts, Pie charts, Histogram ...................................... 51
8.3.2 Display or Comparison of Proportions from Counts/Frequencies: Pie charts, Percentage bar charts ........ 56
8.3.3 Display or comparison of (Scale) Values: Histogram, Line diagram, Error Bars ........................................... 56
8.3.4 Display of relationships between two values: xy-Scatter ............................................................................. 60
8.3.5 Editing Charts ................................................................................................................................................ 60
9 Statistical Tests: Inferential Statistics ........................................................................................................................... 62
9.1 Sampling Distributions: The Background to Hypothesis Testing & Inferential Statistics ..................................... 62
9.2 Hypotheses, Type I and Type II Errors, Power, Effect Size .................................................................................... 63
9.2.1 Type I Errors .................................................................................................................................................. 65
9.2.2 Type II Errors and Power............................................................................................................................... 66
9.2.3 Effect Size ...................................................................................................................................................... 67
9.3 Means ................................................................................................................................................................... 67
9.3.1 Central Limit Theorem and More on Sampling Distributions ....................................................................... 67
9.3.2 The t distribution and t tests ........................................................................................................................ 71
9.3.3 Single Mean: Mean that differs from given value ........................................................................................ 71
9.3.4 Two (Independent) Groups: Means that differ from each other ................................................................. 74
9.3.5 Paired (Dependent) Differences: Means that differ within subjects across 2 times .................................... 77
9.3.6 Three or more (Independent) Groups: Means that differ from each other with 3 or more groups: ANOVA
80
9.3.7 Three or more (Dependent) Groups: Means that differ from each other with 1 group across 3 or more
times: ANOVA ............................................................................................................................................................... 80
9.4 Proportions ........................................................................................................................................................... 80
9.4.1 Single Proportion: Proportion that differs from given value ........................................................................ 81
9.4.2 Two (Independent) Groups: Proportions that differ from each other ......................................................... 85
9.4.3 Paired (Dependent) Differences: Proportions that differ within subjects across 2 times ............................ 85
9.5 Crosstabs and Frequencies: Chi-Squared tests ..................................................................................................... 85
9.5.1 Chi squared independence testing ............................................................................................................... 86
9.5.2 Chi squared in Excel ...................................................................................................................................... 92
9.5.3 Chi-squared goodness of fit testing .............................................................................................................. 93
9.5.3.1 Normal distribution test ........................................................................................................................... 93
9.6 Relationships: Regression and Correlation: An overview ..................................................................................... 97
9.6.1 Residual Analysis ........................................................................................................................................... 98
9.6.2 Correlation Coefficient.................................................................................................................................. 98
9.6.3 Multivariate Regression ................................................................................................................................ 99
Page 2 of 102
10 Reading and Writing Statistical Results .................................................................................................................. 100
11 Further Study .......................................................................................................................................................... 100
12 References .............................................................................................................................................................. 101
13 Change History ........................................................................................................................................................ 102
Page 3 of 102
Preface
This document is a work in progress, as the “DRAFT” at the start indicates. Feel free to e-mail me if you spot any
mistakes or feel something could/should be improved upon. This document is targeted at business students taking an
SPSS/statistics component of some sort. Much of the detail will be useful for anyone starting off with SPSS/statistics,
who has successfully completed a previous course in basic statistics.
Where the results of a statistical analysis will possibly be of some critical importance than I’d suggest getting a
professional statistician involved!! This document is only an introduction to SPSS and statistics!!
Nov/Dec-2012
© Colm McGuinness
Page 4 of 102
Business Research Methods - SPSS/Statistical Component
1 Introduction: SPSS
SPSS is a powerful statistical software package, whose origins lie in the social sciences. It is capable of doing “simple”
summaries of data using summary statistics, tables and charts, or advanced general linear model and data mining
techniques.
This document was written with V20 in mind. Screen shots are generally from V20, with the odd one from V19 or from a
trial of V21.
There is A LOT of help available within SPSS. Most dialogs have a help button, with generally detailed information and
examples available behind these help buttons. There is also a Tutorial, which is available from the application’s opening
screen or from the Help/Tutorial menu path.
There are also Case Studies and a Statistics Coach, available from the Help menu. The Statistics Coach starting screen is:
Page 5 of 102
Business Research Methods - SPSS/Statistical Component
Actual functionality available in SPSS depends on the license. Individual modules have potentially separate licensing
requirements, depending on the “base package” chosen. Some “base packages” come with a number of add-on
modules, and some don’t! If you notice options missing from menu paths or from particular dialogs, then it could be
down to a license restriction.
Page 6 of 102
Business Research Methods - SPSS/Statistical Component
If the immediate help (or the expanded tree of help under the option
you land on) isn’t sufficient, then note which module of SPSS that the
help brought you to above (Statistics Base here) … and use this to find a
corresponding case study (details below).
Navigate now to Help/Case Studies, and then into the case study section that corresponds to the SPSS module you
found above:
Page 7 of 102
Business Research Methods - SPSS/Statistical Component
Here … the case studies are divided up into modules, so expand the module
tree corresponding to the SPSS module found from the direct help above:
“Statistics Base” here.
Now within this you will generally be able to find a case study of the topic you
are interested in, assuming it is a high level (ie important) command!
You can then click through details of a step by step analysis, with
comments/conclusions, and follow/repeat the steps yourself to see exactly
how it works. Very useful …
2 Introduction: Definitions
There are lots of terms whose definitions we might benefit from knowing. I’ve included a couple of the main ones
below. In reality I often double check on Wikipedia if I am in doubt!! I have generally found the statistical information
there to be quite good/useful.
Terms that are of particular importance to know + understand prior to conducting statistical analyses are the following:
- Null hypothesis
- P-value
- Significance
- Alternative hypothesis
- Confidence interval
- Outlier
- Influence
- Normal
Term Description
Alternative Hypothesis (H1) See Null Hypothesis first:
This is the alternative statement to the “null hypothesis”, and is often (but not always!)
the one we are interested in proving likely/true.
Central Limit Theorem (CLT) This theorem asserts that means (or in general “sums”) calculated from samples of
sufficient size (n>30?) will be approximately normally distributed. This is VERY useful
Page 8 of 102
Business Research Methods - SPSS/Statistical Component
since it allows us assume normality, for example of the sample means, with no
restriction on, or even knowledge of, the population data distribution!
Confidence Interval A range of values within which we are confident at a certain specified level, that a
population parameter/statistic will lie, eg a 95% confidence interval for a population
mean consists of a lower and upper limit value, derived from a sample, and we can be
95% confident that the population mean will lie within this range1.
Exploratory Data Analysis This is when we investigate a set of data, in an ad-hoc fashion, in an attempt to gain
(EDA) insights about and from the data. Typically we might use statistics, tables and charts as
investigative tools. Might also be seen as part of inductive research, where we explore
the data to generate a hypothesis or theory: This could later be tested against a new set
of data, from a new experiment.
Hypothesis A testable statement, eg Drug X reduces total cholesterol by more than the current
standard treatment, drug Y.
Influence A data point is said to be influential if omitting it makes a significant difference to the
result(s). Clearly a result that is significantly changed by a change in one (or a small
number) of data point(s) is not likely to be reliable.
Normal A very commonly occurring distribution in statistics is the “normal distribution”. It is the
classic “bell curve”:
Figure: On the left is the histogram we might get with just 5 groups/bins, summarising a
relatively small amount of collected sample data. On the right is what we would get, if
the data were normally distributed, and we could gather sufficient data to “fill” many
bins. The normal curve shape is shown by the dashed line.
Many statistical tests assume normality either of the data, or of the distribution of the
statistic being tested, such as testing for differences in means. The central limit theorem
is very useful in this respect.
Null Hypothesis (H0) This is often the “nothing new going on here” hypothesis. It is often (but not always!) the
alternative hypothesis that we are actually most interested in. The null hypothesis is
often referred to in statistical packages as H0. See also p-value below.
Outlier A point that we believe or can measure to be “extreme” in some way, large or small.
Such points often require special attention as part of statistical analysis.
p-value Typically the probability that the null hypothesis, H0, is likely to be true, given a statistic
calculated from your data.
Reliability This term has various levels of definition, but two important levels of reliability are:
- Internal reliability: The agreement within a questionnaire, or set of data, of
points that address the same topic.
- External, or test-retest, reliability: The agreement of results across a repeat of
the entire experiment.
See http://en.wikipedia.org/wiki/Reliability_(statistics), for example.
Robust Robust statistics are statistical methods that are not heavily influenced by small
departures from statistical assumptions.
Sampling Distribution Just like data being distributed in some way, such as shown for the “normal” term above,
1
This is not quite technically correct, but it is a fairly decent working understanding.
Page 9 of 102
Business Research Methods - SPSS/Statistical Component
sample statistics also have distributions. For example we take a sample and calculate
some statistic, for example the mean (but also applies to any statistic!). We take a fresh
sample and calculate the statistic again, and then again, and again, etc. All of these
attempts will result in results that will differ somewhat. This is just a result of random
variation amongst the samples. If we were to plot a histogram of the calculated statistics
we would see the “sampling distribution” for that statistic.
Because of the central limit theorem (CLT) we know that the sampling distribution for
the mean is normal. It is also known that the sampling distribution of differences
between means is also normal. And many statistics have known distributions, all of
which is then used by statistical packages as part of statistical testing.
Significance Another term used to describe or interpret a p-value. Statistical significance (typically
seen from a low p-value, with “low” meaning below 0.05, for example) is separate from
practical significance! A new drug might be statistically significantly better than some
standard treatment, but if it is only marginally better, and is more expensive or has more
side effects, then it’s practical significance might be “not significant”.
Statistical significance seeks to determine where there is a result that is not explainable
by random chance alone.
Standard Error (SE) Statistics, such as the difference in two means, are subject to variation if we were to take
additional samples. This difference can be the result of just random variation between
samples. The standard error for a statistic is a measure of what we would expect to be
the variation for that statistic. It is in fact the standard deviation of the statistic.
In SPSS if you were to test for a difference in means from two independent samples, you
will see headings such as “Std. Error Difference” in the result output. This is the
“standard error of the difference”.
Validity Like reliability above, this term has various levels of definition, but at a high level
represents the agreement of experiment and “reality” (if we could access it!!). For
example, does a questionnaire designed to measure intelligence actually measure
“intelligence” or is it measuring something else? Intelligence tests can be culturally
biased, for example, and hence might not in fact be measuring “intelligence” but rather a
“cultural exposure”. Actually measuring “intelligence” is not simple, and may not even
be possible!! (IMHO)
See http://en.wikipedia.org/wiki/Validity_(statistics), for example.
Variance This is a statistic that measures the spread/variability within data. The standard
deviation is the square root of the variance.
Table 1: Important and commonly used statistical terms.
Page 10 of 102
Business Research Methods - SPSS/Statistical Component
In reality this gold standard is unlikely to be appropriate for many business applications where time + money are
important limiting factors. However, where compromise is applied it is important to fully appreciate that results
obtained may have flaws of some sort. In addition, while flawed research might randomly not harm a business today, it
might do next time!! Under the pressure of “getting results” (quickly/cheaply), it is important to know what weight to
place on such results, and what flaws might be present as a result of “getting results”!!
Pragmatism and compromise are important in life … but always be aware of what was lost or left out! Don’t substitute
pragmatic reality with actual reality!!
4 Survey Data
Details for questionnaire creation and sampling techniques are not included here as they are covered elsewhere.
After your questionnaire is designed you should then decide on how every possible response, including “No response”
(ie missing values) will be coded.
4.1.1 Codebook
A codebook is a document that specifies details for each variable in SPSS; How these relate to questionnaire
question/response codes; How missing values will be handled/coded.
If there is a standard codebook available then this should be used so that your analyses can be compared with any
previous analyses. If you were to code Male as 0, and Female as 1, whereas the codebook coded Male as 1 and Female
as 0, this could cause confusion if terms from certain analyses were compared. It does depend on the analysis. For some
analysis it makes no difference! But it is better to stick to a standard coding, if one is available. It can save a lot of work
and re-working in the longer term.
An example entry from a codebook for the gender survey question might be:
Questionnaire Name Description SPSS Variable Responses Missing Value SPSS Measurement
type
Gender Gender of the gender 0 = Male 9 Nominal
respondent 1 = Female
SPSS Analyze/Reports/Codebook can be used to produce a codebook listing for selected variables. For example,
including the gender and level of education variables from the telco.sav sample file, and default output/statistics,
produces:
Page 11 of 102
Business Research Methods - SPSS/Statistical Component
gender
Position 10
Label Gender
Type Numeric
Standard Attributes
Format F4
Measurement Nominal
Role Input
0 Male 483 48.3%
Valid Values
1 Female 517 51.7%
ed
Position 7
Level of
Label
education
Format F4
Measurement Ordinal
Role Input
Did not
1 complete high 204 20.4%
school
High school
2 287 28.7%
degree
Valid Values
3 Some college 209 20.9%
4 College degree 234 23.4%
Post-
5 undergraduate 66 6.6%
degree
Coding responses with a fixed number of possible responses is relatively straight-forward. To code missing values,
where a respondent didn’t answer a particular question, choose a value that is impossible from the given range of valid
responses. For a Likert scale, coded as 0 to 4 (0 = Strongly disagree, 4 = Strongly agree), a missing response could be
coded as 9 or 99.
Page 12 of 102
Business Research Methods - SPSS/Statistical Component
When you think of service quality, what is the first word that comes to mind?
___________________________________________________________________________________
These are coded by first making a list of all of the unique responses, and then coding them as normal. Say we had a total
of 5 unique responses as follows:
Word Coded as …
Fast 0
Friendly 1
Competent 2
Cheap 3
Focussed 4
Now we would re-review all of the responses and code them appropriately. So it is a two-pass process to code open text
responses.
Description Example
Tick all that apply. How would you like to pay for the service?
Tick all that apply:
Credit card
Cash
Cheque
Paypal
Select single option for Thinking of the service that is provided, how would you rate the following:
each of multiple Circle your chosen response:
questions. Very Poor Very Good
Speed: 1 2 3 4 5 6 7
Value for money: 1 2 3 4 5 6 7
Usefulness: 1 2 3 4 5 6 7
Enter text with more What problems do you see with the service?
than one response
allowed; Answers have ______________________________________________________________
no order to them.
______________________________________________________________
______________________________________________________________
Enter text with more What are the top two improvements you would like to see made to the service?
than one response
allowed; Answers have 1. _____________________________________________________________
specific order to them.
2. _____________________________________________________________
Page 13 of 102
Business Research Methods - SPSS/Statistical Component
The initial coding of these involves creating a separate variable for each possible response, and treating these
individually like single response items. It is within SPSS that each of the above may be handled differently. This will be
dealt with in section 4.1.7 below.
Example Coded as …
How would you like to pay for the service? Create a variable for each possible response, so:
Tick all that apply: SPSS Measure
Credit card Name Variable Responses type
Cash Credit payCC 0 = No Nominal
Cheque card 1 = Yes
Paypal Cash payCash 0 = No Nominal
1 = Yes
Cheque payCheque 0 = No Nominal
1 = Yes
Paypal payPayPal 0 = No Nominal
1 = Yes
Thinking of the service that is provided, how would you rate Create a variable for each possible response, so:
the following: SPSS Measure
Circle your chosen response: Name Variable Responses type
Very Poor Very Good Service servSpeed 1 to 7 Ordinal
Speed: 1 2 3 4 5 6 7 Speed
Value for money: 1 2 3 4 5 6 7 Service servValue 1 to 7 Ordinal
Usefulness: 1 2 3 4 5 6 7 Value
Service servUseful 1 to 7 Ordinal
Usefulness
If appropriate missing could be coded as 9 for each
variable.
What problems do you see with the service? Review all responses, and create a variable list for each of
the unique themes presented across all responses. Create
______________________________________________ sufficient variables to hold respondent’s maximum number
of themes/unique-responses … or limit to your opinion of
______________________________________________ their top three! Now code each theme as the following
sample response themes show:
______________________________________________ Respons Measure
Name SPSS Variable es type
Too costly servCostly 0 = Not Nominal
present
1=
Theme
present
Too slow servSlow 0 = Not Nominal
present
1=
Theme
present
Not servNotFocussed 0 = Not Nominal
focussed present
1=
Theme
present
It might be appropriate to code the responses in a scale:
For example from 0 = not present, to 1 = mildly present, 2
Page 14 of 102
Business Research Methods - SPSS/Statistical Component
What are the top two improvements you would like to see This is coded similar to the second set of details above for
made to the service? the last open text response, however cognisance must be
paid in the analysis to the fact that the responses have a
1. ________________________________________ specific order.
Measure
2. ________________________________________ Name SPSS Variable Responses type
Improve1 servImprove1 Response Nominal
code
Improve2 servImprove2 Response Nominal
code
This type of question could be coded using the first set of
details from the last open text response above, but then
the coding would have to indicate both presence of an
improvement theme, and it’s position (first or second). This
is awkward.
Page 15 of 102
Business Research Methods - SPSS/Statistical Component
Scale Variables that are scale are numeric, and the difference between values can be determined. There
are two types of scale data: Interval and Ratio, but discussion on this is beyond the scope of this
document.
NOTE: Do not confuse the coded values for the actual values. Coded values will typically be numeric, but this does not
then make the variable a scale variable!! A variable’s “measure type” is associated with the original uncoded data for
that variable.
Page 16 of 102
Business Research Methods - SPSS/Statistical Component
The two tabs at the bottom, “Data View” and “Variable View” give you access to the associated area within SPSS. First
we need to setup variables, so click on the “Variable View” tab to get something like the following view. The window
width has extended so that all fields are visible, and the image has been reduced in size to make it fit on the page here:
Field Use/Meaning
Name SPSS variable name; Should contain no spaces, and begin with a letter. Can contain letters and numbers.
Type Tells SPSS what type of data you will be entering for this variable. Numeric; Date or String (ie text) are
probably the most common types. You can also set the “Width” and decimal places here.
Page 17 of 102
Business Research Methods - SPSS/Statistical Component
Width Total number of digits to allow internally for a numeric. Total number of characters for a string/text
variable.
Decimals Out of “Width” how many digits to allow for decimals.
Label User friendly description of the SPSS variable. This will be displayed, rather than the name, for most output.
Values Allowed/valid code values, along with a user friendly description for each.
Missing How missing values will be entered, if any. See missing value section 4.1.6 below later.
Columns Total display width in SPSS data viewer. Has no effect on analysis.
Align Display alignment in SPSS data viewer. Has no effect on analysis.
Measure Scale, Ordinal or Nominal … See section 4.1.4 above.
Role How the variable is to be treated within some analysis. Values possible are Input, Target, Both, None,
Partition or Split. For some analyses a variable is an Input, whereas in others it is a Target meaning it is
what you want other variables in a model to attempt to predict in some way. We’ll be sticking with “Input”
here.
Here’s what the variable view screen looks like after the first six of the multiple response variables from section 4.1.3
above have been entered:
And so on for any other variables. I have only entered some of the MR variables here. Analyze/Reports/Codebook
output for servSpeed and servImprove1, after deselecting the Statistics (since these can’t be generated yet since there is
no data!!) produces:
servSpeed
Value
Position 4
Type Numeric
Standard Attributes
Format F4
Measurement Ordinal
Role Input
1 Very poor
2 Poor
Valid Values
3 Slightly poor
4 Neutral
Page 18 of 102
Business Research Methods - SPSS/Statistical Component
5 Slightly good
6 Good
7 Very good
servImprove1
Value
Position 7
Service
Label
Improvement 1
Format F4
Measurement Nominal
Role Input
1 Speed up
Train staff
2
technically
3 Charge less
Valid Values 4 Hire more staff
More flexible
5
service delivery
Staff more
6
courteous
Once all variables are defined in SPSS then you are ready to enter response data.
Page 19 of 102
Business Research Methods - SPSS/Statistical Component
Whatever missing values you use must not be possible legitimately as responses to questions‼ So if you had a “tick all
that apply” section, with 12 responses, coded 1, 2, …, 12, then you could NOT use 9 as a missing value! Typically 99
would be used here, but any value that is not possible as part of the main response coding can be used.
There are two missing values shown above, 99 and 88, but for many situations just one will suffice.
There are different types of missing values possible. There are questions that someone chooses not to respond to, even
though they “should have” from your point of view, eg someone leaves a gender: Male/Female box unticked. In
addition to this type of missing there are questions which do not apply, which are not “missing”, but are legitimately left
empty as the questions didn’t apply. For example if someone ticks “yes” in question 1 in the following survey extract,
then they will legitimately skip question 2, ie it is not “missing” as such in the same sense that an empty question 1
would be “missing”:
All of this should ideally be coded into SPSS. Sometimes analysing the missing data can be as informative as the main
data‼
Where you have an item, or block of items (ie multiple response, such as “tick all that apply”) that someone should have
answered but they didn’t then I would suggest coding this as a 9/99/999/etc missing value. I say “I would suggest”
because coding is a subjective task, and can be done other ways, depending on your planned usage of the data. In fact
the planned usage is really key to both creating the survey itself, and coding and entering the data. The planned usage
should come first, and the survey design, coding and data analysis come to match the need of the task at hand!
If you have an item, or block of items, that the interviewee did not need to complete for some reason, such as in the
iPhone example above, then this is not “missing” in the sense that they didn’t do something that they should have! It is
missing in a way that you would have expected … I would suggest to code this differently to the scenario in the last
paragraph, and my suggestion would be to leave it empty, which is called “system missing”, or you could code it as
another missing value, such as 8/88/888/etc. Whatever way you do it, ideally there should be some way to differentiate
between the types of “missing”, so that you can potentially use this information later.
Finally … if you have a multiple response, eg “tick all that apply”, and someone ticks one item, then you do not code the
other items as “missing”. For example on a “tick all that apply” question if they tick one box, then they mean “yes” for
this one, so not ticking the others means “no” for those, and you would code it as such. Another scenario would be
where you present someone with 15 options, and you want their top 3, I would suggest NOT to code the unmarked 12
as being “missing” using 9/99/999/etc. but would probably again use system missing, ie blank, as they are legitimately
“missing”. Or else I would create an additional missing category, eg 7/77/777/etc, and use that.
The above are my broad suggestions, which cover most typical situations, but sometimes the above might not suit, and
you’ll have to think through what it is you are trying to achieve and what coding system might facilitate that. You might
start coding things one way, but end up later recoding things differently for some reason that only became apparent
when you had the actual data to work with! This is to be avoided, as it is costly from a time point of view, but it does
happen‼ A good pilot survey will often help avoid this, and save time later.
SPSS has menu options to help with mass changes, such as the Transform/Recode …. menu options.
Page 20 of 102
Business Research Methods - SPSS/Statistical Component
You now enter your respondent’s data across each row. A row is called a “case” in SPSS. Note that SPSS will allow you
type invalid values, such as codes outside of the range of specified “values” for a variable!! Spotting any such
typos/errors is something that is important to do before a full analysis of the data. The old adage of: “Garbage in,
garbage out” applies!! The Analyze/Descriptive Statistics/Frequencies command is one way to attempt to determine any
errant values as it will list all values with their correspond frequency, so it is easy to spot a value that shouldn’t be there!
For large data sets a lot of time may be spent on data cleaning; Looking for any values that stand out in any unusual
way. This can effectively be a whole analysis in its own right!! It is clearly VERY important, since mistakes in the data
entered might completely change the results from any analysis … In fact this is another way to track down possibly
errant values: Look for individual values that strongly influence the results of any analysis, such as outliers.
The “Variable View + MRSs – Initial” and “Variable View + MRSs – Final” datasets, complete with data are available from
the web site at bbm.colmmcguinness.org/live/AdvancedStats.html. The initial version just has variables and data, with
no MR sets/variables created. The final version has the MR sets/variables created.
These are:
Page 21 of 102
Business Research Methods - SPSS/Statistical Component
This menu path allows you to define “variable sets”. Sets defined here do something completely different to
This essentially just gives a group name to the variables variable sets defined opposite!
included in the set. You can then access the Here the command effectively summaries the original set
Analyze/Multiple Response/Frequencies and variables into a new variable, which is the MR set name. The
Analyze/Multiple Response/Crosstabs commands for entries in this variable will be summaries of the individual
this new group/set name. All output will contain the set variables included in the set!!
variables in a block.
For cross comparison of responses from a MR question MR sets are available for some general commands such as
the blocking makes life easier. Chartbuilder.
Variable sets defined in this way are not available for This command has two somewhat different ways of
general commands, such as Chartbuilder. functioning, of which details will be given below.
The variables from multline through to ebill specify whether the respondent has the associated service from the
Telecoms company. It would be useful to see output where these are grouped, so we access Analyze/Multiple
Response/Define Variable Sets to get (after some typing + clicking!):
Page 22 of 102
Business Research Methods - SPSS/Statistical Component
3. MR variable set name, and Note the “Note” here, explaining that these
label. sets are only available for two commands.
Having created the MRServiceSet variable (which SPSS will stick a $ in front of, just to differentiate it from a “real”
variable), now access the Analyze/Multiple Response/Frequencies command. Select your $MRServiceSet variable set,
and click OK to get:
Case Summary
Cases
$MRServiceSet Frequencies
Responses Percent of
N Percent Cases
Page 23 of 102
Business Research Methods - SPSS/Statistical Component
You now have a “nice” grouped output summarising the frequencies, etc., for the individual variables in the set.
Multiple lines
Voice mail
It is clear how only the Yes’s have been counted in the MRServiceSet. The 3740 is the total number of all Yes’s, across all
services. The Responses/Percent is the percentage of Yes’s out of this total number of Yes’s. The “Percent of cases” is
similar to the Percent column for the individual variable output, but now with all of those with no Yes’s for any service
eliminated!! Hence the Yes percentages are higher than for those in the individual variable output. If you check (give it a
go!) you will find that there are 111 cases with no service, ie no Yes’s.
Page 24 of 102
Business Research Methods - SPSS/Statistical Component
The overall dialog here will create a new variable (MRYesSet) that will
contain only summary/count entries for values matching the
“counted value”. These values can be either labelled by their original
variable name (the default, as selected here) or using the labels of the
values selected in the “counted value” (above), which would be “Yes”
here.
Click the “Add” button to create the MR Set, and then OK to complete the task.
Next use Chartbuilder to create a pie chart of $MRYesSet, and after some label additions (using the chart editor) you
should get:
Page 25 of 102
Business Research Methods - SPSS/Statistical Component
$MRYesSet
MR Set with
Label
Yes's
Standard Attributes
Multiple
Type
Dichotomy Set
callid Caller ID 481 48.1%
Can you see how the MR Set has “collapsed”/summarised all of the Yes’s down into this new variable called $MRYesSet?
Labels are the original variable names (which is what we selected). In fact it isn’t “collapsed” but is now a new variable
Page 26 of 102
Business Research Methods - SPSS/Statistical Component
with 3740 entries, 481 of which are “Caller ID”, 485 of which are “Call waiting”, etc. This is clearly quite a different
summary compared to what happened in the last section!
The “Counted Value” can be any valid value, and only these matching values will be included. The set is dichotomous in
the sense that the counted value items are included and everything else isn’t!! So this dialog can be used to summarise
a set of variables several times, once for each of their individual categories. Say a set of MR variables had responses
from 1 (Very Poor) through to 7 (Very Good), then could create 7 separate MR Sets, one matching each category
response. How this is used really comes down to what you want to produce for your reporting, or what particular idea
you are trying to investigate. As mentioned before, always start with the goals/ideas and then fit SPSS to that! Don’t
start with some fancy output from SPSS and try and force it into your work!!
Next to the second type of functionality available from Data/Define Multiple Response Set: But first we need another
data set to make the new type of summary meaningful. Open the “Variable View + MRSs – Initial” data set (available
from web site???) …
Selecting “Categories” here tells SPSS to count the categories across all included
variables. Unlike dichotomies which summaries/works within a variable, categories
summarises/works across variables … It is thus useful when there is an ordering to the
separate but related responses. Here servImprove1 is the respondent’s top suggestion,
and servImprove2 is their second suggestion.
The other textboxes and buttons work as before. Click Add to create the set, and then OK
to exit the dialog.
Page 27 of 102
Business Research Methods - SPSS/Statistical Component
$MRImprove
Type Multiple
Category Set
Multiple Response 1 Speed up 54 21.6%
Categories 2 Train staff 92 36.8%
technically
99 23 9.2%
Page 28 of 102
Business Research Methods - SPSS/Statistical Component
And for comparison, to work out what has happened here, here is the output from the Analyze/Descriptive
Statistics/Frequencies command for the two servImprove variables:
Service Improvement 1
Cumulative
Frequency Percent Valid Percent Percent
Service Improvement 2
Cumulative
Frequency Percent Valid Percent Percent
Taking “Speed up” as an example there are a total, across both variables of 31+23 = 54 respondents who included this
response, and this gives the first entry for the $MRImprove variable. The $MRImprove “Percent” column seems odd to
me, as the percentages are out of the number of original cases (ie 250 here). They add up to 200%.
Using Chartbuilder to create a pie of the $MRImprove variable, after some labelling changes gives:
Page 29 of 102
Business Research Methods - SPSS/Statistical Component
Hopefully you can see that this is quite a different type of summary compared to the way the dichotomy option worked.
It is worth repeating the point from above: How this is used really comes down to what you want to produce for your
reporting, or what particular idea you are trying to investigate. As mentioned before, always start with the goals/ideas
and then fit SPSS to that! Don’t start with some fancy output from SPSS and try and force it into your work!!
While lots of analysis and charts might all be very interesting, it is crucial to remain focussed on the research
agenda/hypotheses, unless this is explicitly an exploratory data analysis.
For an exploratory data analysis (EDA) almost anything goes, and every tool might be used to gain greatest insight into
any information and patterns within the data.
For any data analysis it is an excellent idea to first investigate what others have done in a similar context. Try not to
waste time reinventing the wheel!! There are almost always going to be existing theses, journal articles, books, etc. that
will be useful for ideas and methods.
Action How
Summarize/describe data Frequencies, eg Simple or grouped frequency distributions.
Charts, eg pie chart to see proportions for ordinal or nominal data, or a
histogram to see the distribution for scale data.
Page 30 of 102
Business Research Methods - SPSS/Statistical Component
Page 31 of 102
Business Research Methods - SPSS/Statistical Component
What information might Exploratory research versus a specific hypothesis (or two/three!)? Inductive versus
address the need? deductive research.
Collect data
Checking the data can involve double entry, for example. Also checking that all
values entered are “legal”, conforming to the codebook. Also checking for outlier,
missing or unusual values. Checks can be graphical and/or tabular (or whatever you
can think of!). Work here can save you a LOT of time later! “A stitch in time … !”
Exploring the data, which can be an end in itself, means using tables, charts and
descriptive statistics to see patterns or information within the data. Possibly with a
Check + explore data view to constructing and testing a more formal hypothesis with a later experiment:
An inductive approach.
Any pre-statistical test Some statistical tests have assumptions, such as “data are normally distributed” or
checks? “all cells have frequency of 5 or higher”. These must be checked and reported on.
Some statistical tests have assumptions that can only be checked after the statistical
Any post-statistical test model is created, such as “residuals should be normally distributed” or “residuals
checks?
should have constant variance”. These must be checked and reported on.
Page 32 of 102
Business Research Methods - SPSS/Statistical Component
The plan is to follow thru an actual example, complete with tests and SPSS output.
Might be better to follow thru an actual questionnaire example, rather than the statistical process example … or is this
section needed at all?? Could be overkill …
???
Page 33 of 102
Business Research Methods - SPSS/Statistical Component
Use this command to select the cases that will be included for all further analysis/output, until a subsequent Select
Cases is executed.
Page 34 of 102
Business Research Methods - SPSS/Statistical Component
The first four Select options perform the obvious/stated task. The “Use filter variable” uses a variable uses values should
be 0’s and 1’s only. The 0’s are not selected, and the 1’s are selected. This allows you to create complicated selections
based on variables calculated using the Transform/Compute Variable command.
If you are using the Select Cases command then it is important to keep track for each analysis/chart output as to what is
currently included/excluded!!
Page 35 of 102
Business Research Methods - SPSS/Statistical Component
With a Split File command you can make all subsequent commands apply to sub-groupings within your data. So a single
Analyze/Descriptive Statistics/Frequencies command will in fact be applied to each sub-group based on the Split File
details currently active. As with the Select Cases command it is important to keep track for each analysis/chart output as
to what is the current split, if any!
Page 36 of 102
Business Research Methods - SPSS/Statistical Component
The Compare Groups option applies subsequent commands to each sub-grouping within the active split file. For
example applying the following Split File to the telco.sav data, and following this with by creating a pie chart of “level of
education” produces four pie charts: One for the level of education for Male, Unmarried, another for Male, Married,
another for Female, Unmarried and a fourth for Female, Married.
This is clearly very useful for drilling into sub-group information, relevant to your analysis.
With the Compare groups option, the following mean values are generated from the Analyze/Descriptive
Statistics/Descriptives:
Page 37 of 102
Business Research Methods - SPSS/Statistical Component
Whereas using the Organise output by groups active, for the same Descriptives command produces:
Page 38 of 102
Business Research Methods - SPSS/Statistical Component
Frequency tables are often a very useful way to get a summary for nominal and ordinal data, with the
corresponding charts being bar and pie charts. For scale data it is usually best to group the data, eg using
Transform/Visual Binning, before creating a frequency table, with a histogram being the visual equivalent chart.
These charts are all available as options within the Frequencies dialog.
Tables can be produced with more levels of sub-grouping detail using the Analyze/Descriptive Statistics/Explore
command, or by using the Data/Select Cases command.
For nominal and ordinal data Analyze/Descriptive Statistics/Crosstabs can produce contingency tables of counts,
and carry out subsequent tests for independence (ie chi-squared tests). Even more sub-grouping details are
possible here using the “Layers” option.
Using the telco.sav data, here’s a Frequencies summary of the levels of education of the cases, with a pie chart
(from the Frequencies/Charts button) thrown in:
Statistics
Level of education
Valid 1000
N
Missing 0
Level of education
Page 39 of 102
Business Research Methods - SPSS/Statistical Component
Next let’s see how level of education breaks down by gender or retired status, using the Analyze/Descriptive
Statistics/Crosstabs command as follows:
Page 40 of 102
Business Research Methods - SPSS/Statistical Component
Cases
Gender Total
Male Female
Post-undergraduate degree 33 33 66
Total 483 517 1000
Retired Total
No Yes
Post-undergraduate degree 64 2 66
Total 953 47 1000
We are here getting a count for each level of education, with details for those that are Male, or Female and
separately details for those that are Retired or not.
Say we wanted to get level of education information, by gender AND marital status, then we use the Layers box:
Page 41 of 102
Business Research Methods - SPSS/Statistical Component
Cases
Male Female
Post-undergraduate degree 9 21 30
Page 42 of 102
Business Research Methods - SPSS/Statistical Component
Post-undergraduate degree 33 33 66
This allows you to drill right down into the data, looking for any interesting “interactions” or patterns. As always
don’t get lost in the analyses … Start with your questions, and then produce analyses that attempt to address
these questions!
You could add more layers, to get an even finer level of detail. For example if I added Retired as Layer 2, I’d get
output as followings (only partial output shown here):
Male Female
Post-undergraduate degree 9 21 30
Post-undergraduate degree 22 12 34
Page 43 of 102
Business Research Methods - SPSS/Statistical Component
Post-undergraduate degree 31 33 64
There is loads more you can do within the various Analyze/Descriptive Statistics menu options. Detailing them
would take too long … it is much better to open the telco.sav (or other) sample file for yourself and try out some
of the commands. See what insights you can gain into profiling who was surveyed, and in particular who churns2
and who doesn’t!
More details will be given in sections that follow, but taken starting with the stats, rather than the menu options.
8.2 Statistics
A statistic is just any result calculated from a set of data. The maximum, minimum, total are all examples of
statistics. So “statistics” need not be all that fancy or technically complex … although they can be!!
In order to summarise data two key types of statistics are required: A measure of middle, and a measure of
spread. Either one of these alone is really quite a poor summary as the following examples will demonstrate.
There are many further statistics possible, but these two types are fundamental.
Say I take a sample of 100 cartons each from two breakfast cereal production lines. I find the mean weights of
cereal to be identical at 499.5g. Does this tell me that the production lines are operating identically? Well not if
the following are the associated distributions for the sampled data:
2
Churning is the term used to describe customers switching service providers.
Page 44 of 102
Business Research Methods - SPSS/Statistical Component
While these two distributions of the 100 sample values have the same middle/mean at 499.5g, it is clear that
the two distributions are NOT the same! And consequently the two production lines are not operating the same.
The x-axis scales have been set to be the same on both charts. Here this highlights that the production line 2
data are far more spread out. There is a lot more variability between cartons of cereal on production line 2
compared to production line 1. Thus a measure of spread is needed to reasonably summarise the data.
Take a similar situation to the above example. We take a sample of 100 cartons and only measure
spread/variability using the standard deviation (or any measure of spread), without a measure of middle/centre.
Say we find that the standard deviation is 4.598g. Does this (alone) mean that the production lines are operating
similarly? Look again at these distributions:
This time, while these two distributions of the 100 sample values have the same spread as measured by the
standard deviation of 4.598gg, it is clear that the two distributions are NOT the same! And consequently the two
production lines are not operating the same. The x-axis scales have been set to be the same on both charts.
Here this highlights that the production line 4 is under-filling cartons (assuming the average should be 500g).
Thus a measure of centre/middle is needed to reasonably summarise the data!
Page 45 of 102
Business Research Methods - SPSS/Statistical Component
The mean is probably the single most commonly used statistic. It is a key statistic when dealing with normally
distributed data as the population mean and population standard deviation fully determine the associated normal
distribution. The mean gives the centre of a normal distribution, and the standard deviation relates to the spread of the
distribution.
You can calculate the mean/median/mode for a variable from various menu options under the Analyze/Descriptive
Statistics menu path.
Since the mean is “biased by extreme values” another mean you will see at times in SPSS is the “5% Trimmed Mean”.
This is the mean that results after first eliminating the top and bottom 2.5% of data. If this differs from the “mean” then
it is indicating that there are some extreme and biasing values present. Other trimmed means are possible such as a
10% trimmed mean, but I haven’t seen this available in SPSS.
The mean would be commonly associated with parametric statistics, whereas the median is more common when non-
parametric statistics are appropriate. The simplest example would be using the mean with (approximately) normally
distributed data, but using the median if the data are clearly not normally distributed. Note that this is not saying that
the mean isn’t useable with non-normal data, just that it is common with normal data, and not quite as common with
non-normal data.
Page 46 of 102
Business Research Methods - SPSS/Statistical Component
Half the width of the central 50% of the Scale data – Doable with raw
data. scale data, but might make
Quartile deviation
Median Not biased by extreme values, but ignores more sense with data in
(QD)
all but the Q1 and Q3 quartile values. groups/bins.
Ordinal data
Basically a more flexible/precise version of As for QD.
the QD, since any range of values can be
measured, eg a 5 to 95 percentile range
Percentile Ranges Median
would be calculated from P95 P5 , which
measures the central width of 90% if the
data.
The variance is simply the standard deviation squared. When dealing with statistics from mathematical point of view it is
often the variance that naturally arises in calculations and not the SD. A population variance would often be written as
2 , directly indicating it as the square of , the population SD.
It has been established theoretically that if your population data3 has mean and standard deviation then the
standard deviation of means taken from repeated samples of size n will be . This statistic is called the “standard
n
error of the mean”. It is a VERY important statistic, as it represents a measure of how much we might expect means to
vary by, between samples. We might only have one sample, but the theory allows us to know what could have
happened had we had (infinitely) many samples!!
Roughly speaking what we are talking about here is “imagining” what would happen if we could take multiple random
samples from a population: The theory work has been done, which we won’t be deriving here, but to give you at least a
clear idea, consider the following diagram of a population4 with some samples also depicted:
3
Typically Greek letters are used to represent statistics for a population. And the likes of x and s are used to represent statistics
for a sample.
4
This is a fairly crude representation! The way the samples are depicted they look like cluster samples, and not random samples!!
For the stats to work they MUST be random samples. If you must deviate from a random sample then this would lower the
robustness of your research, and this must all be reported on!
Page 47 of 102
Business Research Methods - SPSS/Statistical Component
Overall Population
Sample 1
Mean= x1
Sample 3
Mean= x3
Sample 2
Mean= x2
Sample 4
Mean= x4
Dashed lines are (crude!)
depiction of distances of
each sample mean from Mean of means
the mean of means. (will be the population mean)
Although only 4 samples are shown, you can imagine the setup if there were hundreds or thousands, etc. The “standard
error (SE) of the mean” is a measure of how much these means all vary from each other, or put more accurately how
they vary from the mean of the means!! This is the “natural variation” (sampling error/variation) that we can expect to
occur if we did take more than one sample. Luckily for us we will only need one sample to know the SE of the mean, at
least approximately. If we know then we know that the SE of mean is , but we generally won’t know so we
n
s
will approximate the SE of mean with . This approximation has consequences, which will be mentioned later.
n
The following diagram depicts what we know, following a sample, and is a bit more technically accurate than the above
thought experiment!
Page 48 of 102
Business Research Methods - SPSS/Statistical Component
2
N ,
n
An absolutely key point behind many (all?) inferential statistics is the following:
We know that sample results/stats will vary from each other, and we will often (in theory) know some measure of this
sampling error/variation (eg the standard error of the mean above). So if we take a new sample and find a substantially
bigger variation than we would expect according to the sampling error/variation, then we know we have evidence for a
“statistically significant effect”. A great many statistical tests are based on this core idea.
Say our default theory (known as the null hypothesis in statistics) is that the mean for a certain population should be
100g (say). We take a sample, and find the mean is 98g for the sample, with a standard deviation of 3.017g. Graphically
what this example would look like is:
5
The example details here are taken from the “Single Mean: Mean that differs from given value” section.
Page 49 of 102
Business Research Methods - SPSS/Statistical Component
3.017 2
It is not fully explained here, but our N 100,
sample mean of 98g ends up being 30
out here (see below), for the default
s
theory of 100g, and the value of
n
3.017
which is 0.55 (2dp).
30
Our null hypothesis was a population mean of 100g, but the evidence (our sample) does not support this! The obtained
sample mean is actually highly improbable (which you will see when you perform the T Test later in section 9.3.3
below!). In this situation the sample mean is just too different from the expected mean; It is at the outer limits of the
normal/expected variation of means, so we would reject the null hypothesis! The evidence does not support a
population mean of 100g. We have found a statistically significant difference between the mean and the expected value
of 100g.
8.2.4 Meaningless Statistics: Just because you can, doesn’t mean you should!
Even when it is possible to calculate a particular statistic, a key question is “does it make sense?”!! For example with the
telco.sav sample data file, it is possible to calculate the mean of the Geographic indicator variable, as follows:
Analyze/Descriptive Statistics/Descriptives:
Producing output:
Page 50 of 102
Business Research Methods - SPSS/Statistical Component
Descriptive Statistics
But does a geographic indicator value of 2.02 mean anything!! Probably not, I would expect in this context.
8.3 Charts
Using charting to get an understanding of your data is generally very important. While it is easy to conduct a statistical
test to compare means (say), this test alone is not giving you a full picture of your data. Any statistically significant
difference detected is still (to some degree) just encouragement to go look at your data to see why there are
differences, and where there are differences! Charting is very important.
If you have setup your variable definitions fully then you can just click OK here. If you haven’t then why haven’t you!?!
You know what to do … ! ;-)
Page 51 of 102
Business Research Methods - SPSS/Statistical Component
You select the type of chart you want on the left. Here we’ll create a bar chart of level of educational information, so
you click on the bar chart you want (the top left one) and drag it into the preview area to get:
Page 52 of 102
Business Research Methods - SPSS/Statistical Component
The “Element Properties” dialog appears automatically (on the left in the example above). This is a floating dialog, so
can be moved independently of the main dialog. If you close this you can easily retrieve it by clicking on the “Element
Properties …” button in the main dialog.
Next: Drag the “Level of education” variable onto the “X-Axis?” box in the preview area. You’ll get an initial (and small)
preview of your bar chart. You can click on different elements here, and the “Element Properties” dialog will change to
show you options that you can apply to the selected element. You could change the type of y axis from “Count” to
“Percentage” for example. Make whatever changes you want, and then click OK … where you try to create a chart and
the OK button is greyed out, it means that you haven’t fully defined the chart. Having selected Percentage for the Y axis,
and then clicking OK produces:
Page 53 of 102
Business Research Methods - SPSS/Statistical Component
Say I wanted to break this chart down into male/female for comparison purposes then I could start as before,
but before clicking OK, click on the “Groups/Points ID” tab. For example:
Page 54 of 102
Business Research Methods - SPSS/Statistical Component
Page 55 of 102
Business Research Methods - SPSS/Statistical Component
8.3.2 Display or Comparison of Proportions from Counts/Frequencies: Pie charts, Percentage bar charts
The section title says it all here … these are the appropriate charts to use when trying to investigate, display or compare
proportions derived from ordinal or nominal data. For scale data it generally only makes sense after binning the data.
8.3.3 Display or comparison of (Scale) Values: Histogram, Line diagram, Error Bars
Histogram and line diagram are fairly standard diagrams, so are left for you to investigate for yourself. Error bars might
be new, so …
Error bars are a way to display confidence interval (and other) data for sub-groups directly on a chart for overall
comparison. Say we wanted to graphically compare the mean “Household Income” by education level. We could just
have created a scatter/dot plot with level of education as the x-axis categories, and mean household income for y-axis
values:
Page 56 of 102
Business Research Methods - SPSS/Statistical Component
When you initially drag the “Household Income” variable to the y-axis it would show
household incomes for all cases if you were to proceed to the output now.
In order to summarise down to just the mean within each educational level you must change
the “Statistic” from “Value” to “Mean” in the Element Properties window, and then click the
Apply button.
Page 57 of 102
Business Research Methods - SPSS/Statistical Component
Interesting as this chart is, as was mentioned in section 8.2 above, it is more informative to include a measure of spread,
in addition to a measure of centre. We can do this in a number of ways, but a common approach is to include a
confidence interval. You can actually do this from the “Element Properties” window of the scatter/dot plot above!
However (for no particular reason) we will use the “Simple Error Bar” template from the Bar chart gallery as follows:
Page 58 of 102
Business Research Methods - SPSS/Statistical Component
It automatically selects the “Display error bars” option, and defaults to a 95%
confidence interval.
Page 59 of 102
Business Research Methods - SPSS/Statistical Component
This considerably changes how one might interpret the chart!! For example, the first chart seems to suggest a
considerable difference between “Did not complete high school” and “Post-undergraduate degree” groups, whereas
including the spread shows that there is a lot of variation amongst the latter group, to the extent that it is no longer
“obvious” that these groups differ significantly. We could now perform a T Test to check this statistically, which will be
left for you to check for yourself after you’ve completed section 9.3.4 below6!
Working with the chart editor is mostly a matter of just working out where the option/button you want is, so details are
left for your own trial and error!
6
You will find that the difference is in fact statistically significant, with a p-value (significance level) of 0.026. The confidence interval
for the difference is [68.7, 4.6] (in thousands).
Page 60 of 102
Business Research Methods - SPSS/Statistical Component
Page 61 of 102
Business Research Methods - SPSS/Statistical Component
There are two broad types of inferential statistics: Parametric and Non-parametric statistics. Many statistical tests
make assumptions of various types. It is VERY important to check that any required assumptions are actually met,
otherwise any statistics and conclusions could be invalid/wrong! One common assumption is that the data or some
statistic will follow a particular distribution. Where a specific distribution is an assumption, then you are likely dealing
with parametric statistics. Part of your analysis work is in estimating the parameters of the relevant distribution. Often
this is fine, and the central limit theorem certainly helps when it comes to tests on means! However, sometimes you will
find that a distributional assumption is not valid, and then you’re stuck!
When a distributional assumption is not met, an alternative statistic approach is the non-parametric approach. These
non-parametric statistics can also have distributional, and other, assumptions, but they are typically less restrictive than
corresponding parametric statistical methods. There is typically some penalty for this, such as the test being less
powerful7.
To further complicate matters, some parametric tests can have assumptions which are technically needed, but
requirements can sometimes be loosened up, under “correct” conditions. When this is possible the statistical test is said
to be robust to the loosening up.
This document mainly deals with parametric statistics, and doesn’t dwell on assumptions! It is worth reiterating from
the preface:
Where the results of a statistical analysis will possibly be of some critical importance than I’d suggest getting a
professional statistician involved!! This document is only an introduction to SPSS and statistics!!
So what?!
Well, the SD is now an estimate of how this statistic randomly varies between samples. One could call this “natural”
variation as it is comes about as a natural result of taking random samples from the population.
7
This is “power” in the statistical sense: The ability to detect an actual difference.
8
Typically populations are assumed to be infinite or at least much much (10 times or more) bigger than our sample. When sampling
from finite/”small” populations this needs to be taken into account using statistics that have a “finite population correction”. This is
beyond the scope of this document.
9
Assuming that the population has variety present!
Page 62 of 102
Business Research Methods - SPSS/Statistical Component
Say we have been contracted to manage a marketing campaign. Part of the contract is that we must produce statistical
evidence as to whether the campaign had an actual impact or not.
We could take a random sample from the target population before the campaign and measure (in some way) their level
of knowledge of the product to be marketed. We then carry out the campaign, and afterwards we take a second
random sample from the population and again measure level of knowledge.
There will likely be a difference between the two measurements, the before and the after measurements. What we now
would really like to know is: Is this difference so small that it is likely to in fact be nothing to do with the campaign but is
just part of sampling variation (ie “natural”/inherent variation), or is the difference sufficiently big so that we can be
sure it is in fact more than what can be accounted for by random sampling alone? If it is then we can assert that the
campaign has had a definite effect. And this is what hypothesis testing does!
The distribution of the T1 , T2 , ..., Tk is known as the sampling distribution, and the “natural” variation is known as
sampling error10. It is the distribution that results for a statistic from repeated random samples from a population. Using
hypothesis testing, as in the marketing example above, we can determine if the population has changed or not, for
example.
In addition to being able to conduct hypothesis testing as outlined above, knowledge of the sampling distribution for
our statistic (whatever it is!) will enable us to make inferences from our sample about the overall population. This,
“inferential statistics”, is VERY powerful!! If this wasn’t possible then a lot of medicine, for example, just wouldn’t be
possible ... Knowing that a drug worked to a certain level for any sampled group would tell us nothing about the rest of
us!! Luckily this is not so, and knowledge from a sample can be used to describe a range of likely effects on a wider
population. Similarly in marketing, for example, knowledge of the preferences of a sampled group can be used to infer
details of the likely preferences of a larger population. And so on for many areas of life!
To understand hypothesis testing let us consider a criminal trial. The default position is that the accused is innocent, and
the alternative is that they are not innocent (stating the obvious!). If someone is innocent then there are a range of
things, pieces of evidence, that we might expect to find, such as a good alibi. We have various actual pieces of evidence,
so we can see how many of these actual pieces of evidence match what we would expect given innocence. How many
pieces match up, in probability terms, is called the p-value. So the p-value is the probability of the evidence given
innocence, or more generally is the probability of achieving a particular statistic (eg your particular sample mean) from a
certain default distribution of such statistics (eg distribution of sample means if default is in fact true).
If lots of actual evidence matches up with what we would expect given innocence, then we reject the guilty accusation,
and accept innocence11. Statistically this corresponds to a “large” p-value, eg p=0.53, or in general a p-value greater
than our chosen significance level; typically p>0.05, for example, results in the alternative not being accepted.
If little or none of the actual evidence matches up under the assumption of innocence, then we reject innocence, and
accept guilt12. Statistically this corresponds to a p-value less than our significance level, for example p<0.05.
Even in the presence of evidence we may still make the wrong decision, either way, see Table 2 (below) for more
details.
10
Although it is not an “error” as such! It is just the natural result of taking a random sample.
11
Of course actual innocence may or may not be the true fact!!
12
Which again may or may not be the true fact!!
Page 63 of 102
Business Research Methods - SPSS/Statistical Component
Many research questions are framed in the words of a (null) hypothesis, and an alternative. For example:
Hypothesis: The average amount spent by customers in the supermarket over the weekend was €100.
Alternative: The average amount spent was not €10013.
Hypothesis: Proportion of those surveyed agreeing or strongly agreeing with statement X was 0.5.
Alternative: Proportion agreeing or strongly agreeing with statement X was not 0.514.
Hypothesis: New drug is no different in relieving pain than the standard drug.
Alternative: New drug is different (or better, or worse15).
Hypothesis: Marketing campaign has not raised awareness of product.
Alternative: Marketing campaign has raised awareness of product.
The default hypothesis is called the null hypothesis in statistics. The null hypothesis is typically a statement along the
lines of “nothing interesting/new here”. The alternative is called just that: The alternative hypothesis. These are
respectively referred to as H0 and H1 in statistics.
The null hypothesis tells us what sampling distribution we need to consider (ie distribution of means, or proportions?).
This corresponds to us considering what things are possible if H0 is indeed true. We can later use information against
what we actually find, and see how likely what we actually found is, if H0 is in fact true.
The alternative tells us the type of hypothesis testing we must do. See table above, and footnote 15. In most cases this
tells us whether we carry out a 1 or 2 sided test, although more complicated alternatives are possible.
It can take a little getting used to in order to frame research questions in this type of format, however this is really
essential if you want to use statistics to prove/disprove any such hypothesis. It is essential to be clear what the default
hypothesis is, and what the alternative might be. Typically the “default” hypothesis states some kind of status quo type
expression, and it is often in proving the alternative that we are in fact interested16.
When viewing the following table remember that we do not (in general) know reality, we just know the evidence and
infer what we believe must be true from this. Behind what we infer and believe still lives the reality!
Reality They are (in reality) innocent They are (in reality) guilty
What we do
We’re right ... We found them guilty,
We reject innocence We have found them falsely guilty!!
and they (in reality) are indeed guilty.
We’re right ... We found them
We accept innocence innocent, and they (in reality) are We have found them falsely innocent.
innocent.
The four outcomes from the above table are possible in any hypothesis test17 as follows:
13
Later questions obviously might be: Was it above or below the €100.
14
Again, later questions are obvious.
15
The statement of the alternative has significance in statistics as it not only is the alternative to the initial hypothesis, but it can
also include other information. For example if you somehow knew that the new drug was definitely not worse, then the
initial/default hypothesis is as stated, but the alternative can include this extra info, and would become: The new drug is better than
the standard drug.
16
For technical reasons it can be somewhat “simpler” if we can arrange to reject H0, if our experiment works out as “expected” as
we then won’t need to concern ourselves with Type II errors or Power issues.
17
Or any argument!!
Page 64 of 102
Business Research Methods - SPSS/Statistical Component
It is important to be aware of the four possible outcomes, since this will help you both understand what you have from
your CI/hypothesis test and what issues (and resolutions) there might be.
The Type I (spoken as “type 1”) and Type II (spoken as “type 2”) errors go back to hypothesis testing where these are
the standard terms. If you examine the table you will see that:
18
A “p value” is the probability of getting the calculated sample test statistic (eg mean) assuming the null hypothesis is true. Think of
it as evidence for H0. If p is “low” (ie p ) then there isn’t evidence from your given sample supporting H0 and we might
(perhaps) reject H0 in favour of H1. If p is “large” (ie p ) then there is evidence that H0 is indeed true and we don’t reject H0 at
the given significance, for the given sample size.
Page 65 of 102
Business Research Methods - SPSS/Statistical Component
2
N ,
n
There is of a probability that the Population mean , at centre
2
of unknown sampling
sample mean will be in either tail, giving an
distribution of means.
overall probability that ̂ is in a tail of .
And an overall confidence level of 1 .
The consequence of a Type I error is that we will incorrectly reject H0. As an example using proportions consider:
We calculate our 95% CI and it comes out as (say) 0.55, 0.75 . This is entirely above 0.5 (the 50:50 point) so we would
conclude at 95% level of confidence that we should reject H0. But you have to bear in mind that there is a 5% chance
that this is incorrect, when in fact it was just a “freak” sample. These are “random” samples (or they should be!!) so this
is going to happen from time to time.
One approach to avoiding type I errors is to calculate your CI for a range of confidence levels, and see what “best” level
of confidence you can achieve is. Here this would be the highest level of confidence that the entire interval is above 0.5.
At 95% (above) it is just over the 0.5, so the 99% level might fail. This is then highlighting to us that we may be in type I
error territory. If so it is just a matter of pointing this out in any analysis. If another sample was possible then this could
reassure us one way or other.
Some level of type I error is typical in any analysis. It is just commented on as part of the analysis. Obviously the smaller
it is the better, which is why it is good to see what the highest level of confidence we can generate is.
While an important idea we will not progress this much further, as it is a more advanced topic.
The “best” thing to do is always set up your hypothesis so that you can reject H0, and then you can never have any type
II error at all!
If you fail to reject H0 then you must consider type II error, and test power.
Page 66 of 102
Business Research Methods - SPSS/Statistical Component
Another way to avoid/lessen/remove type II error is to increase sample size and redo your sample, in the hope that a
larger sample will contain sufficiently precise and accurate19 information to reject H0.
9.3 Means
The mean is probably the single most commonly used (abused20?) statistic. It is used with scale data. It represents the
arithmetic centre of all of the data. It is affected by extreme values, and can become unrepresentative of a typical value
as a result.
This is VERY useful. A lot of real world data will not necessarily have a “nice” mathematically accessible distribution. If
we were relying on the distribution of the population to be “nice” then this could cause problems for doing statistics.
However the CLT tells us that we need not concern ourselves too much about the population distribution, if we intend
to base our statistics on means, as they will be normally distributed, once the sample size is “large enough”.
What is “large enough” is hard to say!! But many people take n>30 as sufficient to invoke the CLT. In fact it gets invoked
even for small-ish sample sizes!
19
The terms precision and accuracy are often confused. Something is precise if the range of answers is narrow, even if that are not
accurate. Accuracy refers to the correctness of a value.
20
Be careful not to calculate a mean where it is not the appropriate measurement to produce a typical/representative value! For
example with data that has extremes, or indeed for data that is coded as a number but which isn’t in reality scale data! A simple
example is Male/Female data coded as 0/1. It is possible to calculate the mean, but it makes no sense!
Page 67 of 102
Business Research Methods - SPSS/Statistical Component
After a single 5 means have been calculated from 5 samples (ie 5 samples of size 5, and 5 samples of size 25) we get:
Page 68 of 102
Business Research Methods - SPSS/Statistical Component
And it is still not obvious that there will be a pattern to the sampling distributions ... However, if I advance things with
1000 such random samples (ie 1000 of size 5, and 1000 of size 25) we get:
Page 69 of 102
Business Research Methods - SPSS/Statistical Component
Notice that the two distributions of sample means below are both beginning to display
some level of similarity with a normal distribution ... as we know they MUST because we
know of the CLT!!
There have now been 11,010 sample means calculated for each distribution below,
random samples taken from the population above. Also included below is a normal
distribution curve, and you can hopefully see just how close the distribution of means
actually is to the normal curve!! Even for samples of size 5!!
This simulation tool is a great way to get to grips with sampling distributions. It is possible to display sampling
distributions for difference statistics, with different sample sizes, and different populations.
Page 70 of 102
Business Research Methods - SPSS/Statistical Component
So because of the CLT we know that means (from samples of “sufficient size”) will be normally distributed. This is great
because a lot is known about the normal distribution!!
This is why the first three tests in the following sections are called “t tests”.
Example: We want to check that a machine in a factory actually delivers an average of 100g of sweets per bag.
We take a sample of 30 bags, and weigh the contents, recording the results in SPSS. We find that the mean is 98g.
However this difference could be just sampling variation/error. We conduct a One Sample T Test:
One-Sample Statistics
One-Sample Test
Lower Upper
Page 71 of 102
Business Research Methods - SPSS/Statistical Component
One-Sample Statistics
The “Std. Error Mean” is the “standard error of the mean”, and it is the standard deviation of
the means that you would expect to get, if you were to take many separate samples, and for
each one calculate a new mean. It is thus a measure of the sampling error/variation that we
expect to get when calculating means here. Every sample statistic also has a standard error
statistic associated with it, to describe the variation we would expect to find from sample to
sample for this sample statitstic.
Next to the test results table, which has the key information here from the t test:
Page 72 of 102
Business Research Methods - SPSS/Statistical Component
df stands for “degrees of freedom”. Here it is just n 1 . The shape of the t distribution
is affected by the sample size, since the bigger the sample the better our estimate of
population standard deviation will be. This is taken into account by this “df” term.
Typically (across different statistics) the higher the DF the better.
Sig. stands for (statistical) significance. This is the “p-value” (See Table 1) here, and
represents the “evidence” that supports the null hypothesis, ie that the population of bags
actually has a mean of 100g, given the sample information/evidence.
We would “reject the null hypothesis” here as the p-value is that there is only 1 in 1000
chance (ie 0.001) that the population mean is in fact 100g, given the sample evidence.
The “2-tailed” refers to how we state our “alternative hypothesis” (See Table 1). If we
believe that the alternative might be mean > 100 (or for that matter mean < 100) then we
would conduct a 1 sided test. Since it could be either way here, this needs a 2-tailed test!
One-Sample Test
Lower Upper
this t statistic is
98 100 .
0.551
The “95% confidence interval” gives us a range of values that we are 95% confident that the actual difference
is. Here we are 95% confident that the population mean is below the expected 100g by from 0.87 up to 3.13.
While many people just use hypothesis tests, and quote p-values, confidence intervals are more useful as they
contain more information. From a hypothesis test we just know that there is 1 in 1000 chance that the
population mean is in fact 100g, whereas the confidence interval actually tells us a range of sizes for the
difference. Other common confidence intervals are 90% or 99%.
Page 73 of 102
Business Research Methods - SPSS/Statistical Component
9.3.4 Two (Independent) Groups: Means that differ from each other
Once one has the concepts for a single mean, from 9.3.3 above, the ideas are relatively easily extended to compare two
means. As it happens there are two quite different research designs that are possible which could result in comparing
two means, and it is CRUCIAL to be clear which one you are actually dealing with!
This section deals with the design where there is a sample taken from two independent groups. This is sometimes called
a “between subjects” design, as we will be investigating differences between two separate/independent groups.
The alternative, dealt with in section 9.3.5 below, is where the same group are sampled on both occasions. They are
thus not independent groups! This is sometimes called a “within subjects” design, as we will be investigating differences
within a single/dependent group.
As an example, say we have conducted a survey of student opinions of a maths web site(!!), where these are scored on
a 1 to 5 Likert scale21, with 1 being “very poor” and 5 being “very good”. We carried out a survey before making some
changes to the web site, and a second survey of a new random group of students (say the following year22), after making
changes to the site. We want to see if student opinions have changed, ie mean2 = mean1?
Note that although this example relates to possible changes across time, it is still a between subjects design, since we
are not using the same subjects for the second survey.
Statistics
Student opinion of maths web site
Valid 22
N
Missing 0
Median 2.50
Mode 4
Valid 27
N
Missing 0
Median 4.00
Mode 5
21
See for example http://en.wikipedia.org/wiki/Likert_scale for a discussion on whether it is valid to treat Likert data as scale data,
which is required for the t test here!
22
Asserting that any change in the opinions is attributable to the web site changes could easily be challenged here! Someone could
argue that it is simply a new set of students, one year on, who happen to feel more positive. Research design should ideally consider
how change might be attributed, so that such questions can also be answered!
Page 74 of 102
Business Research Methods - SPSS/Statistical Component
Have a look at the various summary data. It certainly (to me!) seems to suggest that things have improved in the opinion
of the sampled students ... but ... could this difference just be down to sampling error, or is it something “new”. To find
out we conduct an “Independent-Samples T Test”, making sure to clear any “split file” command.
Page 75 of 102
Business Research Methods - SPSS/Statistical Component
Group Statistics
Page 76 of 102
Business Research Methods - SPSS/Statistical Component
NOTE: The main test results (next) have been split here to make the whole table visible within the page. In
SPSS this is a single table:
We’ll come back to the Levene’s test below, but first the main point is the “Sig.” information, which shows that there is
likely to be a real difference in opinion between the two groups. From the confidence interval information the real
difference could be as low as 0.5 of a point on the Likert scale (of little practical significance), up to 2 points on the scale.
When reviewing this type of information be careful to know what was actually calculated: Here it was
mean(before) – mean(after), so the minus signs in the CI are in the web site’s favour!! Had it been
mean(after) – mean(before) then negative CI values would mean things got worse in the opinion of the second group!
There are two sets of statistics shown, one row is for “equal variances assumed” and the other for “equal variances not
assumed”. This is to do with how the population standard deviation approximation, used in the t test, will be calculated,
and also how the t test itself is calculated. Levene’s test has a significance level of 0.642 here, indicating there is good
evidence to believe that the variances are indeed equal, so we can reasonably safely use the first row of results as our
actual results. The “F” under the Levene heading refers to the F distribution, which is just another of a number of well
known distributions.
9.3.5 Paired (Dependent) Differences: Means that differ within subjects across 2 times
It is worth reading sections 9.3.3 and 9.3.4 above before this section.
Look in Help/Case studies, and then Statistics Base/T Tests/Paired Samples T Test. Good experience to get used to the
case studies help files. Expand the tree item, and the sub-tree item, and you should see:
Page 77 of 102
Business Research Methods - SPSS/Statistical Component
Carrying out the steps given in the case study should ultimately result in output as follows:
Page 78 of 102
Business Research Methods - SPSS/Statistical Component
N Correlation Sig.
Paired Differences
Mean Std. Deviation Std. Error Mean Lower Upper t df Sig. (2-tailed)
Pair 1 Triglyceride - Final 14.063 46.875 11.719 -10.915 39.040 1.200 15 .249
triglyceride
Pair 2 Weight - Final weight 8.063 2.886 .722 6.525 9.600 11.175 15 .000
At this stage, having gone through the last two sections, the interpretation of this output should be fairly straight-forward.
Page 79 of 102
9.3.6 Three or more (Independent) Groups: Means that differ from each other with 3 or more groups:
ANOVA
In section 9.3.4 above a method for comparing two means was presented. However, it is equally possible to have 3 or
more means that need to be compared. While this can be achieved via multiple T Tests, this is not good practice. Every
statistical test involves a potential error (Type I or Type II), and so conducting lots of tests of any sort leads to an
increased likelihood that the overall results/conclusions are likely to be incorrect/flawed. This is called the family-wise
error rate … the overall error from conducting several/many separate tests.
It is possible to conduct a single statistical test involving many means, and this can be achieved via an ANOVA. The
acronym ANOVA stands for ANalysis Of Variance. The reason for this naming comes from the mathematical background
to ANOVAs.
There are many types of ANOVAs possible, and the details are beyond the scope of this document.
The relevant menu paths in SPSS are: Analyze/Compare Means/One-Way ANOVA, and Analyze/General Linear Model,
with all 4 sub-menu items being relevant in the latter path. There are further relationships between the general linear
model and regression, but this is again beyond the scope of this document.
9.3.7 Three or more (Dependent) Groups: Means that differ from each other with 1 group across 3 or
more times: ANOVA
Similarly to the last section, 9.3.6 above, coverage of this topic is beyond the scope of this document, but to give a
starting point: The relevant technique here is called a Repeated Measures ANOVA.
The relevant menu path in SPSS is: Analyze/General Linear Mode/Repeated Measures.
9.4 Proportions
Wherever you might calculate a percentage out of some total number/count, then you might find it useful to carry out
statistical testing on proportions. All percentages are just proportions multiplied by 100%, so 33% is 0.33 as a
proportion.
A common use of statistics for proportions is the analysis of Likert scale data. Say you have a scale from 1 to 5, with 1 =
Very poor, 2 = Poor, 3 = Neutral, 4 = Good, 5 = Very good, rating a given service. From your data you find that 60% of
respondents rate the service as poor or very poor. While that is bad news, now you are wondering what might this
sample result tell you about the possible population proportion rating for your service!
To tackle the purely statistical hypothesis question of “The majority of the population rate my service poorly”, a
crosstab could be used (see “Crosstabs and Frequencies: Chi-Squared tests” section later). However a test that produces
a confidence interval (CI) is often “better”, since the CI conveys not only the information about the hypothesis, but also
a range of values that are likely for the statistic involved. The “better” here has to be quoted since there are always pros
and cons for any test, so the CI might not be “better” at all, if it lacked some other desired statistical property, such as
power. It depends. Let’s not get too distracted at this introductory level by too many subtleties23.
23
Just ignoring the possible presence of subtleties is NOT good. Knowing that they are there at least should make you weary of
jumping to too many conclusions, or not having some healthy scepticism about conclusions drawn from results. Like many learning
tasks, learning statistics is best viewed as an evolving process. No subtlety today; some awareness next week; Even more in a year’s
time!! And so on.
Page 80 of 102
Tests for proportions are not directly available in SPSS, but there is an easy workaround so long as sample is “large”, and
test proportion isn’t near 0 or 1. Altman, et al., 2005, gives relatively straight-forward details for more accurate testing
of proportions.
The “trick” to get SPSS to carry out proportion testing is to create a new variable that has the following coding: 1 = Has
the characteristic we are interested in; 0 = Doesn’t have the characteristic we are interested in.
The reason that this method is approximate is that proportions calculated from counts are not continuous! Not every
value is possible in between actually possible values. Not unless you can continue to get a larger and larger sample!
There are ways to calculate proportional statistics exactly, but they are beyond the scope of this document.
willStay
Position 44
Format F4
Measurement Ordinal
Role Input
1 No way 159 15.9%
What would be interesting (for example) here would be to find out if the population is likely to have a preference for
staying or not staying with the service provider. This corresponds to:
Null hypothesis: Proportion saying they will stay 0.5 (ie there is nothing “interesting” here, and people have no
particular opinion).
Page 81 of 102
We create a new variable, willStayY, which we compute from the existing willStay variable as follows:
Transform/Compute variable:
3. Use the “If” button to limit the cases that will result in
willStayY = 0 (from above). See next screenshot below.
Page 82 of 102
Repeat the Transform/Compute variable steps again, but this time for willStay>3, and recode it to 1:
Page 83 of 102
The neutral value here of willStay=3 can either be left as “system missing” (ie dots in these cells in the data view), or
else you can use the Transform/Compute variable one last time to recode the willStay=3 into willStayY=9, with 9
representing “missing value” here.
Go to the variable view, and change the willStayY variable to have 0 decimal places, and if you used 9 as missing value,
then enter this for the missing field.
Page 84 of 102
The output results are:
One-Sample Statistics
One-Sample Test
Lower Upper
We can now see that the proportion does differ statistically significantly from 0.5, so respondents definitely had some
preference. The sample mean is 0.56, but the 95% CI tells us that the population proportion that are likely to say they
would stay is 0.5 + 0.3 up to 0.5 + 0.10, which is 0.53 up to 0.60.
Looking purely at the significance we’d just know that there was likely to be a preference, but having the CI tells us the
range of the preference, so is more useful and informative.
9.4.2 Two (Independent) Groups: Proportions that differ from each other
Same idea as in section 9.4.1 above: Recode the original variable into a new 0, 1 variable. Then use the
Analyze/Compare Means/Independent Samples T Test on the new variable.
9.4.3 Paired (Dependent) Differences: Proportions that differ within subjects across 2 times
Same idea as in section 9.4.1 above, except first recode the two original variables into two new 0, 1 variables. Then use
the Analyze/Compare Means/Paired Samples T Test on the two new variables.
An interesting and useful question at times is: Do the pattern of counts that you get in a crosstab have any significance?
One common approach statistically to this question is a “chi-squared” test, which is sometimes written as 2 , which is
the Greek letter “chi”.
A chi-squared test is an approximate test (beyond this doc to explain why!), but is usually “good enough” once none of
the cells in the crosstab have resulting count less than 5.
Page 85 of 102
A chi-squared test can be used in a number of different ways, which are named differently, but in essence they are all
tests of frequencies against some sort of expected values, under some sort of assumption. The two most common tests
are: Chi-squared test for independence, or a chi-squared test of goodness of fit.
Lots of people who never smoked get lung cancer, so someone who wants to defend smoking might argue that lung
cancer is NOT related to smoking, it is just something that people can get, and anyone can get it. To test this idea we
could gather some data on a cohort of people, some who smoke, and some who don’t. We test everyone for lung
cancer, and produce (for example) the following table:
A chi-squared test compares the table observed counts against counts that would be the expected counts if smoking
and lung cancer are NOT related … These are easy to calculate by hand, as follows:
If we ignore smoking then there are 100 people with a lung cancer diagnosis out of the 1000. So under
100
the assumption that smoking makes no difference then we would expect of any group to
1000
actually get lung cancer. We apply this fraction to the 475, and the 525, and this gives us the number
of lung cancer diagnosis we would expect to get if smoking makes no difference:
100 100
475 47.5 , and 525 52.5
1000 1000
Similarly there are 900 out of 1000 that don’t get lung cancer, so we can this fraction to the 475 and
525 to get the number we would expect not to have lung cancer, if smoking makes no difference:
900 900
475 427.5 , and 525 472.5
1000 1000
24
I’ve made this example up!! But if you Google “smoking and lung cancer chi squared” you should be able to find actual
data/examples.
Page 86 of 102
Oi Ei
2
observed counts, and Ei are the expected counts, under the assumption of smoking making no
difference. This 2 value can then be compared against a “standard” chi-squared distribution to see if
it is what we would expect if the two tables of frequencies agreed.
We’ll come back to this example, and complete the details in Excel in section 9.5.2 below. You
probably won’t be surprised to find that the test will show that the assumption is NOT correct, so lung
cancer diagnosis here is NOT independent of smoking.
This idea can be applied to many situations. Taking the telco.sav sample file, we can check if churn is independent of
education level, gender, marital status, etc.
As shown above, enter level of education, gender and marital status for the rows, and churn for the columns (it actually
makes no difference which are which for the Chi-squared test). You can use the “Cells” button to output the expected
counts (and more besides) if you want. Click on “Statistics”, and (as shown below) select the “Chi-square” option:
Page 87 of 102
Click continue and then OK.
I’ve just included the level of education * churn crosstab and chi-squared results below. Others are omitted for brevity:
No Yes
Post-undergraduate degree 38 28 66
Total 726 274 1000
Chi-Square Tests
Page 88 of 102
The significance here, shown having been calculated using various techniques, is less than 1 in 1000, ie < 0.001, so the
test is telling us that there is a significant difference in churn across different levels of education. There are further
options to investigate which cells in the crosstab are actually contributing most to the significance. Under the “Cells”
button you can tick the “Unstandardized” residuals option (see below), which will show you the individual Oi Ei
results. The bigger (in absolute size, ignoring the size for now) these are the more that cell is contributing to the
Oi Ei
2
statistically significant result … It is better (IMO!) to actually look at the individual results to get a more
Ei
accurate view of what is producing the stats result, but that option is not available directly. It can easily be done by hand
using SPSS’s observed and expected values:
Repeat the above crosstab now gives output including the following:
No Yes
Page 89 of 102
Residual -1.7 1.7
Count 38 28 66
Oi Ei
2
Churn
No Yes
Did not complete high school 3.86 10.22
High school degree 1.17 3.10
Some college 0.02 0.05
College degree 4.58 12.14
Post-undergraduate degree 2.05 5.41
And now look for what is “large” (ie contributes most to producing the significant result). Clearly “Did not complete high
school” and “College degree” seem to be the main contributors, as shown shaded in the table. These two alone produce
a 2 test result of 0.0002, which is already significant! If I were working in customer services for this company I would
try and target these two groups to try and lower the churn rate specifically within them. This also gives us a very clear
example of where statistics can be used to be part of the analysis and solution to a very real business problem!
I would imagine other analysis might produce other groups or sub-groups that could be “targeted”.
You can use the “Layers” crosstab option to make SPSS produce crosstabs and chi-squared results within subgroups, so
for example including Gender as Layer 1:
Page 90 of 102
With the Statistics options set to chi-squared, and the cells left to default produces, amongst other output:
Chi-Square Tests
Page 91 of 102
Where the chi-squared results are now broken down by Gender, in addition to the overall results.
As an example, there is a slight difference between the numbers of males and females in the sample with 483 males and
517 females. Is this evidence that there are in fact more women than men in the population?
Gender
Under the assumption that there is no difference in numbers of men and women the expected number of each would
be of 500, ie half the total sample size. This gives the expected values. Transferring all this to Excel, after some
formatting produces:
Individual Chi-squared
ObservedExpected Contribution
Male 483 500.00 0.58
Female 517 500.00 0.58
1000
p-value= 0.282296653
The p-value is produced using the CHISQ.TEST command (Excel 2007/2010). This takes the observed and expected
values as parameters and produces the p-value result of the test: The probability of getting the observed counts if the
expected counts are in fact the underlying correct counts, given the population proportions, which in this case are ½ per
category.
The p-value of 0.28 (2dp) here is telling us that there is insufficient evidence to reject the null hypothesis that there are
equal proportions of males and females. This is written in the cautious language of statistics: A less accurate but more
relatable version of this would be: The evidence supports the assumption of equal proportions.
And this also shows how easy it is to do all kinds of count comparisons within Excel, using the CHI.TEST function.
Anywhere you have an observed set of counts, and have a theory on what the expected counts should be, from the
population, you can use a chi-squared test to check the assumption, given the sample/observed data.
This kind of test could easily be used to check “interesting” patterns in Likert scale data, for example:
Page 92 of 102
- There are as many “Very dissatisfied” as there are “Very satisfied”
- There are as many “Very dissatisfied” or “Dissatisfied” as there are “Satisfied” or “Very Satisfied”.
Another approach to the Gender question above would be to do a proportions test against a specific value, ie a one
sample T Test for a proportion, with a test value of 0.5. This approach has the advantage of producing a confidence
interval for the proportion, which isn’t available from the chi-squared test. You will often find in statistics that there are
a number of approaches possible. Each will have its advantages and disadvantages, depending on the context.
If it is possible then a test that produces a confidence interval should generally be preferred over a test that doesn’t. For
count data split into two categories a proportions test should be possible, so would be preferred over a chi-squared
test. The CI gives a ranges of values for population proportion, which is extra information not produced from a chi-
squared test. However, if there are multiple categories this would require multiple proportions T Tests, which is to be
avoided where possible, due to family-wise error.
It is a bit beyond the scope of this document to detail it, but you can use Excel to calculate the numbers of data points
that you would expect to find within each bin, and then compare them in the “usual way” (from above, in last section)
to carry out the test.
It isn’t the most powerful test to use, but it is easy to implement, and relatively easy to understand, and interpret
afterwards.
Move Age in years into the Dependent list. Click on Plots, and select the “Normality plots with test” option. Deselect
Stem-and-leaf, and select Histogram:
Page 93 of 102
Click Continue, OK.
Tests of Normality
Kolmogorov-Smirnova Shapiro-Wilk
There are two tests of normality here: The Kolmogorov-Smirnov test and the Shapiro-Wilk’s test. Try Googling
to see what the differences are! The significance for both is 0.000, which indicates that these tests both
Page 94 of 102
conclude that there is p < 0.001 chance of this Age data actually being normally distributed. That would tend
to be fairly conclusive evidence!!
Sometimes these tests are more strict than is needed. Many statistical tests that technically require normality
are in fact not that sensitive to small to moderate deviations from normality! Rather than “blindly” trust
statistical tests alone here, it is generally a good idea to simply view the data as a histogram, and SPSS obliges
with the histogram output:
While the two statistical tests tell us that the data are unlikely to be normally distributed, the chart shows us
why that might be so. The data above look right skewed to me, for example.
A more statistical approach than simply “viewing the data” is the following Q-Q plot:
Page 95 of 102
A Q-Q plot plots the cumulative observed values against the expected cumulative values if the data are in fact
normally distributed. If the data are in fact normally distributed then you end up with the points all following
the straight line! Where the points differ from the straight line it is indicating that these points in particular
are “causing” the non-normality. Sometimes in analysis one then investigates these specific points to ensure
they are correct. Additionally, sometimes one proceeds with one’s analysis with two sets of data: One that has
the outlier (or non-normal) points removed, and another that has the full data set. One then compares the
results to see if the non-normal points are having a particular influence on the results or not. As mentioned
above, many statistical tests that require normality of the data are actually quite robust to departures from
normality, so it is worth conducting the analysis (if feasible) in this split way, and then compare results.
The chart below is basically showing the same info as the one above, except the line has been subtracted from
the data: The data are “detrended”:
Page 96 of 102
The wave shape here is typical of skewed data.
Regression is one of the general techniques used to determine the nature of the relationship in such circumstances. It is
a very powerful technique. So called “Linear regression” is the most common form of regression, where the nature of
the relationship is assumed to be “linear” (no powers or cross products in it). Non-linear regression can also be
accommodated by introducing dummy variables. More on this later.
The simplest form of linear regression attempts to match an equation y bx a to a set of sample (x, y) data, where
the slope b and intercept a are determined to optimise the fit of the line to the data. The most common fitting
technique is called “Least Squares” (LS). This technique finds b, a so that the resulting line minimises the squared
distances of the points from the line: Hence the name “least squares”.
Correlation is a technique that measures relationships within data, or in the context of LS it measures how good a fit line
is to the data. It is quite possible to fit a LS line to any set of data, so you then look at correlation (and a decent
diagram!) to check to see whether using the LS line was actually a good idea: Does the line fit the data well?
Page 97 of 102
The y bx a model is the simplest form of LS. Where we are interested in more complex relationships then we could
have:
y bk xk bk 1 xk 1 ... b1 x1 a1
Here we are interested in how the factors x1 to xk relate to the y variable. For example we might wonder if end of
year results ( y ) could be predicted by looking at CAO points ( x1 ), age ( x2 ), results from first CA ( x3 ), and so on. Note
that here our data would consist of a y value and an x1 , x2 , x3 set of values, for each case in the sample. This
potentially introduces an additional subscript, eg yi and x1i , x2i , x3i , for each case i . Sometimes there might be more
than one measurement for each x1i , x2i , x3i and this introduces another subject: x1ij , x2ij , x3ij , which refers to the j -th
measurement for the i -th case.
There are various important assumptions that underpin aspects of LS regression, such as:
- The x’s are assumed to have no error, so that these are taken as exact measurements.
- For any given x (or any given x1ij , x2ij , ..., xkij for the multivariate case), it is assumed that the yi are normally
distributed about the LS line25.
- The variance of the y about the LS line is assumed to be constant across the range of x values25.
- The y’s are assumed to be independent, identically distributed25.
Where the yi are not scale variables then they cannot be normally distributed about the LS line, and alternative forms
of LS regression need to employed, such as Logistic Regression for categorical data.
Where we desire to include non-linear terms this can do done via so called “dummy variables”. For examples say we
want to model: y a bx cx 2 , we would introduce the “new” variables x1 x, and x2 x 2 , and now the model
becomes linear in x1 and x2 : y a bx1 cx2 .
Once the coefficients from the model are determined they can be subjected to statistical testing, and confidence
intervals for them can be generated.
Having generated a model it is important to examine the residuals too, as these can be used to tell if model assumptions
have been met. In fact as noted in footnote 25 it is the residuals that should be used to check model assumptions have
been met.
25
Technically it is the residuals that should be independent, identically normally distributed, with constant variance, but as an
overview introduction this will suffice.
Page 98 of 102
A correlation coefficient result ranges from -1 thru to +1, where 0 means complete randomness. A value of -1 means
that as x increases then y decreases. This is call negative correlation. A value of 1 means that as x increases y also
increases. This is call positive correlation.
Correlation coefficients are used in a number of more advanced statistical techniques such as Principal Component
Analysis (PCA) which attempts to find variables that are correlated with each other, within a set of data, so that we
might eliminate some of the variables, and thus work with a smaller set of predictor variables.
Values near to -1 or +1 are referred to as strong correlation, and values nearer to 0 are referred to a weak (or no!)
correlation, depending on the closeness to 0.
It is possible to do a statistical test on a correlation coefficient to test if the value from a sample if likely to be
representative of the population. Additionally it is possible to general a confidence interval for a correlation coefficient.
Multivariate refers to having more than one outcome variable, so called Dependent Variables (DVs). The others refer to
the example from the last section where there is just one DV, but could be multiple Independent Variables (IVs).
Multivariate statistics in general is a fairly complex advanced topic. All I’ll say here is that (a) it is possible, and (b) the
idea is to simultaneously use as much data/evidence as one can. So where there are multiple DVs it might well be
possible to analyse each DV separately using multiple regression, however this is likely to be less statistically capable of
detecting significant effects than a full multivariate analysis would be.
The correlation coefficient is relevant to assessing how two variables relate to each other. For more than two variables
the coefficient of determination, R 2 is used. You can think of R 2 as simply the square of the correlation coefficient,
which is what it is for just two variables.
The coefficient of determination measures how much of the variation in the dependent variable (DV) is actually
explained by the model. So R 2 0.6 would mean that 60% of the variation in the DV is explained by the current
independent variable (IV) model. Clearly the higher this is the better.
However, if one adds in enough IVs into a multiple regression model then you will eventually get a model that will
fully/100% match the data, or at least as fully as is possible. In this case you would get R 2 1 (or close to it), but this
2
would be at the expense of a maximised model, with many many IVs!! The statistic Radjusted takes into account the
number of IVs in the model, and so is a more balanced measurement of how good a model is for matching the data. It
still has the same basic meaning of measuring how much of the DV variation is explained by the model, but the answer
is then “adjusted” to penalise adding in too many variables that don’t really add much to the predictive power of the
model.
Page 99 of 102
10 Reading and Writing Statistical Results
When writing up your own results the best advice I can give here is to follow the format that is normally used in the
publication or company you are writing for! Even before you design your own analysis it is advisable to make yourself
familiar with the target publication, as they might have preferred approaches/methods.
When reading statistics watch out for the p-value: The probability that the null hypothesis is true, given the sample
result/statistic. Pay attention to sample size, since small sample results should generally be given less weight than
results derived from large samples. Notwithstanding this, also pay attention to bias: Ideally samples should be randomly
selected. Deviations from randomness mean that results might be less representative of the population, and so might
deserve less weight/significance. A large biased sample is probably worse than a small unbiased/random sample, since it
could lead to apparently weighty results that in fact could be biased! Typically sample size is written using the letter n,
for example: “A statistically significant difference (t = 5.3, p < 0.001, n1=20, n2=30) was found between the two groups.”
For this format it is typical to quote the actual statistic (t =5.3 here), along with the p-value and sample size. Other
information might also be included, if relevant/useful, such as the sample means and the SE of means for each group.
An alternative to “raw” p-values is to quote a confidence interval for a statistic. This tends to convey more information
than the p-value alone, as it gives the statistical significance, and also a range of values for the likely size of the effect
within the population. For example: “A statistically significant difference (95% CI [-5.2, -1.3], n1=20, n2=30) was found
between the two groups.”
11 Further Study
Watching MythBusters or other TV programmes on the Discovery channel, or elsewhere, you will see applications of
statistics to novel/interesting problems!!
This document attempts to provide a non-mathematical introduction to SPSS and some of the statistics therein. To
further your understanding I would recommend considering taking some of the Royal Statistical Societies professional
examinations: www.rss.org.uk. They conduct examination at the following levels (quoted details copied from the RSS
website Dec 2012):
- Ordinary Certificate: “The Ordinary Certificate is the entry level of the Society's professional examinations. Its
aim is to provide a sound grounding in the principles and practice of statistics, with emphasis on practical data
collection, presentation and interpretation. In terms of level, it is pitched between GCSE and A-level standard in
the English school system, but the nature of the syllabus is very different because of the emphasis on practical
statistical work. It is intended both as a first qualification, an end in itself; and as a basis for further work in
probability and statistics, as for example in the Society's Higher Certificate and Graduate Diploma examinations.
Holders of the Ordinary Certificate should be able to carry out supervised statistical work of a routine kind, or be
able to apply statistical methods, at an elementary level, within work of a more general nature.”
- Higher Certificate: “The Higher Certificate is the intermediate level of the Society's professional examinations. It
is intended both as an end in itself in respect of being a qualification in statistics more advanced than that of our
Ordinary Certificate, and as a basis for further work in statistics up to the highest undergraduate level, as for
example in our Graduate Diploma. It contains some work at the equivalent of A-level in the English school
system, but most of its material is similar to what would be found in the first year of a typical university course
in statistics. Indeed, some of its topics might be in the second year of a university course. It gives a thorough
introduction to statistical theory and inference at this level, stressing the importance of practical applications.”
- Graduate Diploma: “The Graduate Diploma is the highest level of the Royal Statistical Society's professional
examinations. It is of a standard equivalent to that of a good UK Honours Degree in Statistics, giving a thorough
These are self study qualifications. Local examinations are organised once per year. Lots of details including syllabus
details, past exam papers and solutions are available on the RSS web site.
Students having completed first year Business Mathematics and Statistics at ITB to a good standard would have a good
foundation for tackling the ordinary certificate, although further study would be required to build sufficiently on the
foundation to pass overall.
12 References
Altman, D. G., Machin, D., Bryant, T. N. & Gardner, M. J., 2005. Statistics with confidence. 2nd edition ed. Bristol:
Arrowsmith.
Chatfield, C., 1983. Statistics for technology: A course in applied statistics. 3rd edition (revised) ed. Boca Raton:
Chapman & Hall/CRC.
Dytham, C., 2011. Choosing and Using Statistics: A Biologist's Guide. 3rd ed. Oxford: Wiley-Blackwell.
Elliott, A. C. & Woodward, W. A., 2007. Statistical Analysis Quick Reference Guidebook: With SPSS Examples. Thousand
Oaks: SAGE Publications.
Field, A., 2009. Discovering Statistics Using SPSS. 3rd ed. London: SAGE Publications.
Pallant, J., 2010. SPSS Survival Manual: A step by step guide to data analysis using SPSS. 4th ed. Maidenhead: Open
University Press.
Salkind, N. J., 2004. Statistics for People who (think they) Hate Statistics. 2nd edition ed. London: Sage Publications, Inc.