Professional Documents
Culture Documents
Chapter 5 Descriptive Statistics in SPSS
Chapter 5 Descriptive Statistics in SPSS
© Sanjay Singh
2
'The road to every heaven goes through a hell. Bear this in mind."
- Swami Vivekanand, Complete Works
Sanjay Singh
© Sanjay Singh.
Email: sanjay.singh3210@)gmail.com
Images and screenshots of IBM SPSS Statistics software is used in this book for
only learning purposes. The source/credit of any other illustration, image or
resource is duly acknowledged wherever applicable. This book is "work in
progress" and I will keep it continuously updated. Please do not feel offended if
you find typos or errors anywhere and kindly do not rush for an assessment of
IQ & EQ of the author based on some unintentional mistakes that he as a mortal
may commit. I am yet to organize content for many chapters and proof reading
the book is a distant dream. This book is like an experimental release, and it will
take time to give it a final shape. . Kindly behave and do not redistribute content
in unauthorized manner. Any positive comment and suggestion to improve book
is welcome.
This chapter will focus on understanding various sorts of descriptive statistics using SPSS. SPSS
is a very efficient software to use descriptive analysis, apart from inferential analysis as well in
your research work. There are many ways through which you can calculate the descriptive
analysis in SPSS.
First, import any data set in SPSS to demonstrate how to use descriptive analysis. To import a
new data set, open the ‘diabetes_costs.sav’ file from the SPSS root folder by following these
steps - Click on the Folder icon. Locate your C Drive > Program Files (Check for Program Files
folder, if using a 64-bit operating system. If it is a 32-bit operating system, click Program Files
x86 folder). > IBM > SPSS > Statistics > 23 > Samples > English > 'diabetes_costs.sav' > Open.
We will use this data set for demonstration purpose for calculating various sorts of descriptive
stats (Figure 1). This data set contains age of the subject, glucose level, income and treatment
cost. Supposedly, it is studying treatment cost of the subject based on age, glucose level and
income of the subject. We can use this for various sorts of descriptive statistics. Descriptive
statistics can be useful in communicating an overall picture of your data set. For example, you
might wish to analyze the average age of subjects in this data set or the percentage of subjects
below a particular age. If you want to communicate that what percentages of subjects in the data
set is below a certain age (in number of years), you will use percentiles. If you need to
communicate the average age of subject, then you would report the mean or average score. Then,
there are descriptive statistics like median, mode, etc. and we will learn to understand and
calculate them, in a stepwise manner. The variables of study, here, are age, glucose level, income
and treatment cost. So, we might be interested in calculating the average income of the subjects
in our data set and the treatment cost.
© Sanjay Singh
4
© Sanjay Singh
5
years. So, for calculating that, you may wish to calculate the frequencies.
To do that, click on Frequencies and you will see a dialog box (Figure 2) and you can see an
automatically-checked checkbox below, which reads as ‘Display frequency tables’. It simply
means that if you keep this checked, you will see a frequency table in the output.
© Sanjay Singh
6
Distribution is a visual form of the scores. It gives an informative look of the distribution plot of
study variables. There are two measures of distribution listed here - skewness and kurtosis. Next,
one commonly used measure of descriptive statistics are percentile values. They are a type of
score which inform what percentage of people in a given set of observations lie below a
particular score. For example, if in a office, workers are earning less than USD 10,000, then 90th
percentile is USD 10,000. This conveys the location of employee in a series. Many competitive
examinations in India like Common Admission Test (CAT) for MBA examination, you get
percentile scores. Here, percentiles basically tell you what are the percentage of candidates are
scoring below a particular candidate or a particular score. Quartiles are nothing but a type of
percentiles. If you divide all the percentiles (that vary from 0 to 100) into four categories, you
© Sanjay Singh
7
will get four quarters, also known as quartiles. So, these are the types of descriptive stats
available in SPSS and we will learn how to use and calculate each in further chapters.
© Sanjay Singh
8
A number of descriptive statistics options are repeated in a various locations in the software.
Each variation serve a slightly different purpose, for every descriptive statistic option.
© Sanjay Singh
9
Let us open a different data set, here. Open the SPSS root folder
by going through this path - Program Files > IBM > SPSS >
Statistics > 23 > Samples > English > bank loan.sav > Open. In
this data set, the variables are age, education level, etc. Suppose,
we want to assess who is having more average income as
compared to others and the education level has been assigned
values 1 to 5 with 1 signifying ‘Did not complete high school’
and 5 signifying ‘Post-undergraduate degree’ (Figure 3). The
idea is that education brings us better salary and if you want to
Figure 7 : Values for educational level variable find out average salary of post graduates, as compared to
average salary of rest of the groups.
I hope you must have realized the utilities of having different descriptive stats and tabs in SPSS.
They have not just randomly been repeated. They are useful for different purposes and if you
keep practicing on SPSS, gradually you will realize the subtle differences between them. But
certain options are repeated with the same purpose.
© Sanjay Singh
Figure 11 : Descriptives: Options dialog
box
11
Some options are repeated but their purpose is distinct. Gradually, we will learn their purpose
and how to use them. In Descriptives tab, the option ‘Save standardized values as variables’. Is
the only tab you can use for calculation of z-scores. z-scores cannot be calculated by using the
previous Frequencies tab. We will gradually learn these options and over a period of time, you
would get habitual of using these options.
© Sanjay Singh
12
© Sanjay Singh
different ages repeated in this data set. The research question is - Are there more than one person
belonging to the various ages given in the dataset (i.e. 25, 46, 27 and so on)? Let the ‘Display
frequencies tables’ option be checked. Click OK and the output tables are generated.
Now, in the frequencies output, a total of 250 valid cases have been processed which is same as
the sample size (Figure 4). So, all the cases have been processed and there is no missing data.
The frequencies of various ages are given in the output table. There is one person of 13 years of
age, one person of 14 years, one of 15 years and so on. So, our question is - How many persons
are of 25 years age? There are two persons of 25 years age and there are persons of certain ages
which are repeated in this data. The most common age in this data set is 51 years. There are 14
individuals of 51 years in this data set and after age 51, there are 11 individuals of 42 years. So,
frequencies give you a rough idea about which observations have been repeated or the number of
© Sanjay Singh
14
times in your data set. Needless to say, later on, there are going to be modes in this data set. The
Percent column in Table 4 refers to the percentages of total. So, first age (13) is .4 % of the total
( 250). Next column is Valid Percent, i.e. percentages adjusted to missing values. Since there are
no missing values in our data set all our cases are valid, percentages and valid percentages will
be same. Next column is Cumulative Percent which are cumulative addition of the valid
percentages. If you look at this column, you will see that the cumulative percentages are being
added. Adding .4 to .4 gives .8. Adding .4 to .8 gives 1.2. Adding 0.4 to 1.2 gives 1.6. So, that's
how we use and interpret the frequencies option.
Now, let's use one categorical variable and see how frequencies can be even much more useful in
that case. So, we will use ‘bankloan.sav’ data set (Figure 5) from the SPSS root folder. In this
data set, we have 2 categorical variables – level of education (people of 4 educational groups)
and defaulters vs. non-defaulters. The variable ‘default’ is defined as no and yes in Values.
© Sanjay Singh
15
© Sanjay Singh
16
Similarly, you can interpret the Frequency Table for the level of education (Figure 7). There are
5 levels of education beginning from 1 (those who did not finish their high school) till maximum
5 (those who have post-graduate degree). In this data set, those who did not finish high school
are largest, i.e. 460 individuals accounting for 54.1% of the data of 850 individuals. Out of these
© Sanjay Singh
17
850 individuals, 27.6% are belonging to education level 2 (high school degree). 4 is college
degree. So, 54% individuals are those who did not finish their high school. 27.6% finished their
high school, 11.9% have some college experience while 5.8 % individuals in this data set have a
college degree and only .6%, i.e., only 5 individuals have a postgraduate degree.
You need to report this when you write in any journal or in report for an organization about the
overall descriptive scenario of your study sample. So, these are again, valid percentages and
cumulative percentages. So, that's how we calculate and understand the frequencies and
percentages in SPSS. Note that in this case, group-wise frequencies are not given. It means that
you cannot find by this procedure - what percentage of non-defaulters vs. defaulters have a
college degree or those who did not finish high school? So, there is no facility available here that
can be used for doing that. Instead of using Frequencies, you can use the Crosstabs option to find
that. So, in that case, our research question is - What percentage of individuals have a high
school degree, as compared to other degree; those who are defaulter, as compared to non-
defaulter? So, we want to categorize all the educational groups, based on whether they default or
non-default.
better organized data, instead of taking defaulter versus non defaulter for Rows.
It means we prefer lesser number of categories in rows and larger number of
categories in columns.
You can keep lesser number of categories in column, if it is meaningful for your research. For
instance, for a hypothetical variable gender, our research question could be – What percentage of
females have a college degree and what percentage of females are high school pass outs only?
So, for that, we need to click Cells option in Crosstabs dialog box. In the Crosstabs: Cell Display
dialog box, check Row and Columns under Percentages and click Continue and then, click OK in
Crosstabs dialog box.
Now, a more detailed Crosstabs table has been generated in the output. As seen in Case
Processing Summary in output file, there are two types of percentages – row-wise percentages
(Level of education) and column-wise percentages (Previously defaulted).
© Sanjay Singh
19
© Sanjay Singh
20
Now, let us consider category 4 (College degree). Here, there are 38 individuals, with 24
individuals who did not default the loan and 14 individuals who did. The second row is ‘%
within Level of education’. Here, out of 100% (38) individuals, 63.2% did not default their loan
while 36.8% have defaulted their loan. The last row mentioning ‘% within Previously defaulted’
is on the basis of defaulters and non-defaulters. In category 4 (College degree), out of total 700
individuals, 4.6% are non defaulters and 7.7% are defaulters. Together, they account for 5.4% of
the 700 valid cases. Keep in mind that the column wise percentages within each category are
based on total column percentage while the row wise percentages are based on the total
percentages within a particular row only. So, that's how we calculate and interpret the
frequencies in SPSS. Later on, you can create graphs, pie charts, etc. as well and that will be
discussed in further chapters.
In this chapter, we will learn how to calculate some other measures of Descriptive Statistics. Go
to Analyze > Descriptive Statistics > Frequencies. Now, we will explore some more options. In
the Frequencies dialog box, click the Statistics button where, you will find four types of
descriptive stats (Percentile Values, Central Tendency, Dispersion and Distribution) that were
discussed in previous chapters.
© Sanjay Singh
21
If your sample has very high standard deviation, then rather than reporting the mean, it is
preferable to report the median. Similarly, if you have a non-normal data, you are supposed to
report the median because it is basically a value that divides your entire observation, entire series
into two equal parts. 50% of subjects scoring below a particular number that we consider median
and 50% subjects scoring above the median value. In that case, median is also trying to
communicate the location of the scores. So, median can be a sort of location below which 50%
values lie and above which 50% values lie. In that sense, median is also a type of percentile and
you can easily guess, its 50th percentile. In any data set 50th percentile and median values are
always same. Generally, we report the mean but if there is very high dispersion in your data or
your data is non normal, so instead of reporting mean, report the median. If you do a
nonparametric analysis like Kruskal-Wallis test, Mann-Whitney test or any other test, so in that
case it's a rule to report median instead of mean values because you apply a nonparametric test,
generally in case of violation of normality assumption. So, that's why you don't report mean for
the nonparametric test and instead, report the median.
Mode is the value that most frequently occurs in a data set. You can also find out the mode by
looking at Frequencies but if you check on the Mode option in Statistics dialog box, so SPSS will
give only a single value which is most commonly reported in a particular dataset. If you click on
the Sum option, it will give you sum total of all observations for a variable. So, if you select Sum
for age or income, so it is will give a Sum total of all the observations for age and income. Now,
sum total for all the observations for age might be meaningless because you are not looking for
total of all the subjects' age but calculating Sum for income might be instructive. For example, I
have two data sets and I want to know the sum of income in both data sets. So, using Sum option
might give some insight about the data. Yet, it is very less used because you can fairly imagine
that Sum is not very helpful, as compared to the functionality of mean, median and mode.
In the diabetes_costs data set, let us try to calculate average age of the subject and average
income of the subject, median age and median income, the most frequent age or mode of age
and mode of income variables.
(Figure 1). The ‘Display frequency tables’ checkbox will help in comparing the mode with the
most frequent observation and they will be the same.
© Sanjay Singh
23
individuals are lying above 45 years of age. So, 45, is basically, a number that is dividing the
entire series of observations into two equal parts - 50% above it and 50% below it. If the
distribution of age is non-normal, so in that case, reporting median would be more instructive, as
compared to mean value. The mode for Age is 51 and for Household Income, the mode is
$22,600.
What exactly are mean, median and modes? They are the measures of central tendency which tell
us where most of the subjects are located. Considering the age variable, measure of central
tendency is supposed to tell - What is the most common age for all the subjects studied in this
sample of 250 individuals? Measures of central tendency simply convey where most of the
scores are concentrated.. So, what is the most typical age of this sample? If you consider mean as
a measure of central tendency, it is 44 years. If you consider median, it is 45 years and if you
consider mode, it is 51 years. The huge difference between Mean (44)/Median (45) and Mode
(51) depict that mode can be misleading. Modes would be used only when we have sufficient
understanding of it. So, in this case, using mode is altogether not recommended because it is not
describing the most common age value that is occurring in this observation of 250 individuals.
That is meaningless! Let’s take another instance of a shoe shop where you have to report the
most frequently bought or most common shoe size preferred by the customers. Our experience
suggests that it's generally number 8 and number 9 for males. So, in that case, mode might be a
very useful statistic and not, mean or median. For the current case, mode is not a very
meaningful statistic. You can remember a simple heuristic, as given in Figure 4, to remember
which central tendency you should choose on the basis of the nature of your data.
This gives you an idea in which situation to quote mean and in which situation to quote median
or mode? Sum is sum total of all the ages. Again, sum is a meaningless number. For age, sum is
10,999. It is quite meaningless to get a Sum of ages as it is not giving us any valid information
but is only useful for the sake of understanding.
© Sanjay Singh
24
If you look at the frequency chart, it will be again confirmed. So, first look at the frequency chart
for the age. Mode for age is 51 so we expect the persons of age 51 to be repeated maximum
number of times in table ‘Age in years’ under Frequency Table in the output. You will find that
there are 14 individuals of age 51 and the age with second lowest number of individuals is 44 (12
individuals). In case of a continuous variable, it is preferable to use the mean or median (and not
mode) depending upon the normality of data. For Household Income, mode is 22,600. Check the
Frequency table for Household Income for the observation ‘$22,600’ and you will see that all
other incomes are appearing only once in this entire series. So, the software has picked only the
© Sanjay Singh
25
smallest number as the mode, i.e. $22,600 for which Frequency is 1 and Frequency Percent is .4.
That confirms our understanding that mode is generally misleading, in case of continuous data.
So, that's how we calculate the mean, median, mode and report it.
How do we report mean, median, and mode for grouped data? Suppose, we want to report the
averages for males and females, separately, as categorical data. Let us take the example of car
sales data set from IBM SPSS Samples folder. You can find this sample through the following
path - C:Program Files\IBM\SPSS\Statistics\23\Samples\English\car_sales.sav. Now, in this data
set, we have lot of categorical and continuous variables, as well and the research question is to
find out how many car types we have. If you click on this variable, so there are two car types in
this data set – Automobile and Truck. Apart from that, variables given are sale, resale price,
engine horsepower for different vehicle types along with manufacturer, model, etc. Currently, we
are interested in finding whether the average sales price and resale price of automobiles are
higher, as compared to trucks.
© Sanjay Singh
26
Take the variables Vehicle Type and Model and Manufacturer and compare
which among BMW and Audi cars have a better sales and resale price.
that gives 95% confidence interval of mean which is going to tell us that - if you do the analysis
for some unseen sample, same type of sample drawn from the same type of population, then
means are likely to remain between a certain range in 95% of cases. So, keep this option checked
(You can define this value depending upon your requirements). For example, if you want to find
out – What would be the average resale value in 99% of cases? So, in that case, you can define it
as a 99% confidence interval. Currently, we will take default as 95% because that's the default
value and is generally accepted as the standard in statistical analysis. Click Continue. We're not
checking any other option, right now. We’re saving to learn them in the latter chapters. In the
Display Option, There are two options - Statistics, Plots, or Both. So, Explore option gives you
certain plots, as well which are the normality plots and descriptive plots. So, currently, we do not
need plots. So, you can simply click on the Stats but just to have a feel of what it offers to us,
click Both. So, click OK.
© Sanjay Singh
27
In the output for grouped descriptives (derived from previous chapter), we get a different kind of
arrangement. The first table is case processing summary which simply tells us that we have two
variables in our analysis – Sales in thousands and 4-year resale value. This table gives you the
details of each group. For Sales in thousands, in automobile group, there are 91 observations and
in the Truck group, there are 3 0observations. Similarly, the observations for resale variable are
91 Automobiles and 30 Trucks. Individual percentages of these variables have also been
calculated individually. The percentages are 78% and 73%. The missing case percentages for
both sale and resale for Automobiles or Trucks is 21.6% and for Trucks, it is 26.8%. Basically,
number of observations in each group plus missing cases is equal to Total or 100%.
© Sanjay Singh
28
Isn’t it? A major reason for this is that standard deviation is quite good for this.
© Sanjay Singh
29
© Sanjay Singh
30
Inter-quartile range is nothing but a difference between P75 and P25. If you deduct the score
below which 75% of scores lie, with the scores below which 25%of cases lie, so that will give
you the inter-quartile range. In this case, the inter-quartile range is 53.23. Inter-quartile ranges
also tell us about the spread of scores. So, what is the gap between P75 and P25? Let's compare
the inter-quartile range for sales of automobiles versus trucks. Inter-quartile range for
automobiles is quite small - 53.23 for Automobiles and 108.641 for Trucks – almost double! So,
the variation for sales value of truck is comparatively much high, as compared to variation in
sale value of automobiles and that can be confirmed by looking at the standard deviation. The
standard deviation for sale of Automobiles is 52.67 and 111.17 for sale of Trucks. Again, the
value for automobiles is around half of the truck category.
© Sanjay Singh
31
© Sanjay Singh
32
© Sanjay Singh
33
What is the relevance of significant skewness or significant kurtosis? Let us understand this with
an example. Suppose we are comparing different type of cars and one company is beating the
drum about selling a significantly large number of cars at higher than the average price and
suppose in the same company, there is a competition between different branches of that same
company. Outlet A claims that they are significantly better as their salesmen are selling cars at
significantly higher prices, higher than the average price, as compared to Outlet B. So, you can
prove first, whether the outlet A is significantly selling at higher than the average value and then
make a comparison between these two groups. For that, we use the significance value of
skewness and kurtosis. Let us apply this in the current case and assess whether skewness for car
sale prices is significant. Using your calculator, divide the value by standard error to find out
whether there is a significant amount of skewness. Divide 2.567 by 0.427. The answer is 6.011.
Since, it is higher than the three cutoffs (1.98, 2.56 & 3.98). So, there is a significant positive
skewness in the car sales prices. Thus, it is a significantly positively skewed data. The variable
‘sales’ is a very significantly positively skewed variable. Most of the trucks are being, being sold
at a significant, higher than average price and it is the same scenario for automobiles.
© Sanjay Singh
34
© Sanjay Singh
35
Standard error is usually shown whenever you see the mean in SPSS. Standard error basically
tells you the amount of error that you are going to commit when you calculate the mean from a
sample drawn from a particular population. Since our calculation is entirely based on a sample,
so we cannot be sure whether this sample mean is truly representative of population mean.
Higher the difference or deviation between sample mean and population, higher will be the error.
So, for this, we need a measure and we call this measure a standard error. Standard error
basically describes to what extent, the sample mean deviates from the population mean.
© Sanjay Singh