Download as pdf or txt
Download as pdf or txt
You are on page 1of 35

1

© Sanjay Singh
2

SPSS For Uninitiated:


A Visual Odyssey for Mortals

'The road to every heaven goes through a hell. Bear this in mind."
- Swami Vivekanand, Complete Works

Sanjay Singh

© Sanjay Singh.
Email: sanjay.singh3210@)gmail.com

Images and screenshots of IBM SPSS Statistics software is used in this book for
only learning purposes. The source/credit of any other illustration, image or
resource is duly acknowledged wherever applicable. This book is "work in
progress" and I will keep it continuously updated. Please do not feel offended if
you find typos or errors anywhere and kindly do not rush for an assessment of
IQ & EQ of the author based on some unintentional mistakes that he as a mortal
may commit. I am yet to organize content for many chapters and proof reading
the book is a distant dream. This book is like an experimental release, and it will
take time to give it a final shape. . Kindly behave and do not redistribute content
in unauthorized manner. Any positive comment and suggestion to improve book
is welcome.

2020 © Sanjay Singh


3

Chapter Descriptive Statistics in SPSS

37. SETTING DATA FOR DESCRIPTIVE ANALYSIS

This chapter will focus on understanding various sorts of descriptive statistics using SPSS. SPSS
is a very efficient software to use descriptive analysis, apart from inferential analysis as well in
your research work. There are many ways through which you can calculate the descriptive
analysis in SPSS.

First, import any data set in SPSS to demonstrate how to use descriptive analysis. To import a
new data set, open the ‘diabetes_costs.sav’ file from the SPSS root folder by following these
steps - Click on the Folder icon. Locate your C Drive > Program Files (Check for Program Files
folder, if using a 64-bit operating system. If it is a 32-bit operating system, click Program Files
x86 folder). > IBM > SPSS > Statistics > 23 > Samples > English > 'diabetes_costs.sav' > Open.
We will use this data set for demonstration purpose for calculating various sorts of descriptive
stats (Figure 1). This data set contains age of the subject, glucose level, income and treatment
cost. Supposedly, it is studying treatment cost of the subject based on age, glucose level and
income of the subject. We can use this for various sorts of descriptive statistics. Descriptive
statistics can be useful in communicating an overall picture of your data set. For example, you
might wish to analyze the average age of subjects in this data set or the percentage of subjects
below a particular age. If you want to communicate that what percentages of subjects in the data
set is below a certain age (in number of years), you will use percentiles. If you need to
communicate the average age of subject, then you would report the mean or average score. Then,
there are descriptive statistics like median, mode, etc. and we will learn to understand and
calculate them, in a stepwise manner. The variables of study, here, are age, glucose level, income
and treatment cost. So, we might be interested in calculating the average income of the subjects
in our data set and the treatment cost.

© Sanjay Singh
4

Figure 1 : diabetes_cost dataset

38. TYPES OF DESCRIPTIVE STATISTICS

To calculate mean and other descriptive statistics,


go to Analyze > Descriptive Statistics. In
descriptive statistics, all descriptive statistics are
listed there and you can select which ever suits
your purpose. The first option is Frequencies
(Figure 1), which is the most basic type of
descriptive statistic. It simply tells the number of
times a particular observation occurs in a data. For
example, you might be interested in knowing the
number of individuals in the ‘diabetes_costs.sav’
data set are aged 25 years or 46 years. Is there any
repetition? There might be two or three or more
number of persons whose age is 25 years or 46

Figure 2 : Frequencies option in SPSS

© Sanjay Singh
5

years. So, for calculating that, you may wish to calculate the frequencies.

To do that, click on Frequencies and you will see a dialog box (Figure 2) and you can see an
automatically-checked checkbox below, which reads as ‘Display frequency tables’. It simply
means that if you keep this checked, you will see a frequency table in the output.

Figure 3 : Frequencies dialog box

© Sanjay Singh
6

Apart from this, there are other options like


Statistics and when you click that, another
dialog box opens with a four types of
Descriptive Statistics options such as
Percentiles, Central Tendency, Dispersion,
and Distribution further divided into
categories (Figure 3). Measures of
dispersion comprise of standard deviation,
minimum, maximum, variance, range,
standard error of mean (S.E. mean).
Measures of distribution comprise of
skewness and kurtosis. Measures of
location, generally known as central
tendency (Mean, Median, Mode & Sum)
Figure 4 : Descriptive Statistics in Frequencies option
basically tell where the score is located in a
series of observations.
As opposed to a measure of central tendency, we might be interested in assessing the extent to
which scores are spread away from the mean or the central score? In this case, you can use
measure of standard deviation. For instance, if in an office, salary of a person A is highly distinct
(very less or very high), as compared to salary of all other persons working in this office. So, in
that case, the measure of dispersion is an appropriate statistic and for this, you can calculate the
standard deviation of salary scores in this office. If the standard deviation is high, so it might be
possible that in this office, many individuals are earning very high and many individuals are
earning very less implying a huge spread of scores. Of course, if there are highs and lows, so
minimum and maximum would be used. There will be a huge difference between minimum and
maximum score, spreading your entire scores in a wide range. So in such case, range would be
high. So, measures of dispersion informs to what extent, scores are dispersed or spread from each
other.

Distribution is a visual form of the scores. It gives an informative look of the distribution plot of
study variables. There are two measures of distribution listed here - skewness and kurtosis. Next,
one commonly used measure of descriptive statistics are percentile values. They are a type of
score which inform what percentage of people in a given set of observations lie below a
particular score. For example, if in a office, workers are earning less than USD 10,000, then 90th
percentile is USD 10,000. This conveys the location of employee in a series. Many competitive
examinations in India like Common Admission Test (CAT) for MBA examination, you get
percentile scores. Here, percentiles basically tell you what are the percentage of candidates are
scoring below a particular candidate or a particular score. Quartiles are nothing but a type of
percentiles. If you divide all the percentiles (that vary from 0 to 100) into four categories, you

© Sanjay Singh
7

will get four quarters, also known as quartiles. So, these are the types of descriptive stats
available in SPSS and we will learn how to use and calculate each in further chapters.

© Sanjay Singh
8

39. UNDERSTANDING THREE DIFFERENT DESCRIPTIVE TABS IN SPSS

A number of descriptive statistics options are repeated in a various locations in the software.
Each variation serve a slightly different purpose, for every descriptive statistic option.

In this case, since we all have only


ordinal data or scale data, we do
not have any nominal or categorical
data (Figure 1). So, we are not
interested currently in studying
whether there are more males of
more average age or more glucose
level or more income, as compared
to females but if your variable is
gender, as well in this data, you
Figure 5 : Scale Variables might also be interested in
reporting the average age for
males and females. There is no
facility here to calculate the
average age of different nominal categories. Here, descriptive option is not sufficient. Instead,
you can take help of another descriptive option.

Go to Analyze > Descriptive Statistics > Explore.


Explore option gives you facility to create a partition the
descriptive stats, depending upon a factor list.

Figure 6 : Explore option

© Sanjay Singh
9

Let us open a different data set, here. Open the SPSS root folder
by going through this path - Program Files > IBM > SPSS >
Statistics > 23 > Samples > English > bank loan.sav > Open. In
this data set, the variables are age, education level, etc. Suppose,
we want to assess who is having more average income as
compared to others and the education level has been assigned
values 1 to 5 with 1 signifying ‘Did not complete high school’
and 5 signifying ‘Post-undergraduate degree’ (Figure 3). The
idea is that education brings us better salary and if you want to

Figure 7 : Values for educational level variable find out average salary of post graduates, as compared to
average salary of rest of the groups.

The research question, here, is – ‘Does average


salary increase with increasing qualification?’. To
answer this question, Frequencies option cannot
create a partition in age variable or any continuous
variable depending upon a categorical variable (e.g.
level of education or gender). Instead, using the
Explore option is a useful option for this. Click
Explore and you will see a dialog box as in Figure
4. So, we having different variables listed on the
Figure 8 : Explore dialog bos left side and on the right, there are two boxes -
dependent list and factor list. Let us try to find out
what is the average salary of subjects.

Choose the option


‘Household income in
thousands [income] and
shift that to Dependent
List. Shift Level of
education [ed] to Factor
List (Figure 5) . So, this
is going to provide the
descriptive stats for
each and every
educational level

Figure 9 : Shifting variables for which descriptives are needed


© Sanjay Singh
10

category - level one to level five.

I hope you must have realized the utilities of having different descriptive stats and tabs in SPSS.
They have not just randomly been repeated. They are useful for different purposes and if you
keep practicing on SPSS, gradually you will realize the subtle differences between them. But
certain options are repeated with the same purpose.

For example, if you go toAnalyze > Descriptive


Statistics > Frequencies > Statistics…(Figure 6).
You will see these options - mean, median, mode,
sum, etc. If you go to Anaalyze > Descriptive
Statistics > Descriptives > Options… (Figure 7), you
will again find mean, median, mode, and most of the
options seen in Statistics.. dialog box.

Figure 10 : Frequencies: Statistics dialog box

© Sanjay Singh
Figure 11 : Descriptives: Options dialog
box
11

Figure 12 : Descriptives option

Some options are repeated but their purpose is distinct. Gradually, we will learn their purpose
and how to use them. In Descriptives tab, the option ‘Save standardized values as variables’. Is
the only tab you can use for calculation of z-scores. z-scores cannot be calculated by using the
previous Frequencies tab. We will gradually learn these options and over a period of time, you
would get habitual of using these options.

© Sanjay Singh
12

40. CALCULATING FREQUENCIES

Let us learn how to use the Frequencies tab. To initiate


the Frequencies tab, go to Analyze > Descriptives >
Frequencies.

Figure 13 : Frequencies option in SPSS

In the ‘diabetes_costs.sav data set, we


only have continuous variables and no
categorical variables but if you want to
find the frequency of various ages in this
data set, it might not be very meaningful
(Figure 2). You can use the Frequencies
tab. By default, ‘Display frequency
tables’ is checked.

Figure 14 : Frequencies dialog box

Select the Age variable (Figure 3) and


let us see if there are people of

© Sanjay Singh

Figure 15 : Age variable selected


13

different ages repeated in this data set. The research question is - Are there more than one person
belonging to the various ages given in the dataset (i.e. 25, 46, 27 and so on)? Let the ‘Display
frequencies tables’ option be checked. Click OK and the output tables are generated.

Figure 16 : Frequencies output

Now, in the frequencies output, a total of 250 valid cases have been processed which is same as
the sample size (Figure 4). So, all the cases have been processed and there is no missing data.
The frequencies of various ages are given in the output table. There is one person of 13 years of
age, one person of 14 years, one of 15 years and so on. So, our question is - How many persons
are of 25 years age? There are two persons of 25 years age and there are persons of certain ages
which are repeated in this data. The most common age in this data set is 51 years. There are 14
individuals of 51 years in this data set and after age 51, there are 11 individuals of 42 years. So,
frequencies give you a rough idea about which observations have been repeated or the number of

© Sanjay Singh
14

times in your data set. Needless to say, later on, there are going to be modes in this data set. The
Percent column in Table 4 refers to the percentages of total. So, first age (13) is .4 % of the total
( 250). Next column is Valid Percent, i.e. percentages adjusted to missing values. Since there are
no missing values in our data set all our cases are valid, percentages and valid percentages will
be same. Next column is Cumulative Percent which are cumulative addition of the valid
percentages. If you look at this column, you will see that the cumulative percentages are being
added. Adding .4 to .4 gives .8. Adding .4 to .8 gives 1.2. Adding 0.4 to 1.2 gives 1.6. So, that's
how we use and interpret the frequencies option.

Figure 17 : bankloan Data Set

Now, let's use one categorical variable and see how frequencies can be even much more useful in
that case. So, we will use ‘bankloan.sav’ data set (Figure 5) from the SPSS root folder. In this
data set, we have 2 categorical variables – level of education (people of 4 educational groups)
and defaulters vs. non-defaulters. The variable ‘default’ is defined as no and yes in Values.

© Sanjay Singh
15

So, the research question is - What


percentage of individuals in this
data set are in the defaulter
category, as compared to non-
defaulters? And suppose, we want
to find out what percentage of
belong to the different educational
groups. So, for both questions, you
can again use Frequencies. Go to
Analyze > Descriptives >
Figure 18 : Adding categorical variables in Frequencies dialog box Frequencies. The first research
question is – What is the
percentage of defaulters in this data set? We have the same question for education variable, as
well. In the Frequencies dialog box (Figure 6), shift these two variables to your variable box and
click OK.

© Sanjay Singh
16

In the output for the


bankloan data set (Figure 7),
we have 700 valid cases and
150 missing cases, as seen
in Statistics output. That is a
huge number of missing
cases. Level of education
has been reported for all 850
cases. There are no missing
values for level of education
but ‘Previously defaulted’
statistics is not available for
150 individuals. In the
dataset, for subject 700 and
onwards, there is not data in
‘default’ variable. So, that's
why, the last 150 cases are
non-valid cases or missing
cases. So, keep that in mind
for the interpretation
purposes. Next table in
Figure 19 : Frequencies Output for bankloan data set output is the Frequency
Table. For the first variable,
i.e. default, 0 means did not default (No) and 1 means who defaulted (Yes). So, there are 517
individuals out of 700 who did not default their loan while 183 individuals have defaulted their
loan. The percentage of non-defaulters , that is, 0 category, are 60.8 % in our data set and
defaulters are 21.5%. Together, they make a total of 82.4% and 17.6% are missing cases. So, in
combination, missing cases and all the rest cases makes a 100% observation. But if you go by the
valid frequncy, i.e. 700 and you adjust your outcome based on the valid cases only, it can be said
that out of all the valid cases, 73.9% are non-defaulters and only 26.1% are defaulter. So, it is
giving us a picture that around a quarter of individuals who apply for loan turn into defaulters
while rest are non-defaulters, around 73.9%. So, that's how we get a prescriptive picture of our
data set and then, the cumulative percentages are given. So, first is 73.9% and second is 73.9 +
26.1 = 100 %.

Similarly, you can interpret the Frequency Table for the level of education (Figure 7). There are
5 levels of education beginning from 1 (those who did not finish their high school) till maximum
5 (those who have post-graduate degree). In this data set, those who did not finish high school
are largest, i.e. 460 individuals accounting for 54.1% of the data of 850 individuals. Out of these

© Sanjay Singh
17

850 individuals, 27.6% are belonging to education level 2 (high school degree). 4 is college
degree. So, 54% individuals are those who did not finish their high school. 27.6% finished their
high school, 11.9% have some college experience while 5.8 % individuals in this data set have a
college degree and only .6%, i.e., only 5 individuals have a postgraduate degree.

You need to report this when you write in any journal or in report for an organization about the
overall descriptive scenario of your study sample. So, these are again, valid percentages and
cumulative percentages. So, that's how we calculate and understand the frequencies and
percentages in SPSS. Note that in this case, group-wise frequencies are not given. It means that
you cannot find by this procedure - what percentage of non-defaulters vs. defaulters have a
college degree or those who did not finish high school? So, there is no facility available here that
can be used for doing that. Instead of using Frequencies, you can use the Crosstabs option to find
that. So, in that case, our research question is - What percentage of individuals have a high
school degree, as compared to other degree; those who are defaulter, as compared to non-
defaulter? So, we want to categorize all the educational groups, based on whether they default or
non-default.

41. DESCRIPTIVES ANALYSIS USING CROSSTABS

Crosstabs help you create contingency tables in


SPSS. To use the Crosstabs option, go to Analyze
> Descriptive Statistics > Crosstabs (Figure 1).

Figure 20 : Crosstabs option in SPSS

In Crosstabs dialog box, you


can take one variable in a row
and one variable in column.
Shift defaulters versus non
defaulters in Column(s) and
level of education in Row(s)
(Figure 2). That will give us a

Figure 21 : Crosstabs dialog box


© Sanjay Singh
18

better organized data, instead of taking defaulter versus non defaulter for Rows.
It means we prefer lesser number of categories in rows and larger number of
categories in columns.

You can keep lesser number of categories in column, if it is meaningful for your research. For
instance, for a hypothetical variable gender, our research question could be – What percentage of
females have a college degree and what percentage of females are high school pass outs only?
So, for that, we need to click Cells option in Crosstabs dialog box. In the Crosstabs: Cell Display
dialog box, check Row and Columns under Percentages and click Continue and then, click OK in
Crosstabs dialog box.

Figure 22 : Crosstabs output - Case Processing Summary

Now, a more detailed Crosstabs table has been generated in the output. As seen in Case
Processing Summary in output file, there are two types of percentages – row-wise percentages
(Level of education) and column-wise percentages (Previously defaulted).

© Sanjay Singh
19

Now, let us interpret


the Crosstabulation
(Figure 4). The
research question
was - What
percentage of
defaulters belong to
only high school
degree category?
For high school pass,
139 individuals who
did not default the
loan while 59
individuals have
defaulted the loan.
For category 1 (Did
not complete high
school), 293
individuals did not
default the loan and
79 individuals who
Figure 23 : Cross tabulation table did.

Figure 24 : Crosstabulation for individuals with college degree

© Sanjay Singh
20

Now, let us consider category 4 (College degree). Here, there are 38 individuals, with 24
individuals who did not default the loan and 14 individuals who did. The second row is ‘%
within Level of education’. Here, out of 100% (38) individuals, 63.2% did not default their loan
while 36.8% have defaulted their loan. The last row mentioning ‘% within Previously defaulted’
is on the basis of defaulters and non-defaulters. In category 4 (College degree), out of total 700
individuals, 4.6% are non defaulters and 7.7% are defaulters. Together, they account for 5.4% of
the 700 valid cases. Keep in mind that the column wise percentages within each category are
based on total column percentage while the row wise percentages are based on the total
percentages within a particular row only. So, that's how we calculate and interpret the
frequencies in SPSS. Later on, you can create graphs, pie charts, etc. as well and that will be
discussed in further chapters.

42. MEASURE OF CENTRAL TENDENCY – MEAN, MEDIAN, MODE – CONCEPT


AND USES

In this chapter, we will learn how to calculate some other measures of Descriptive Statistics. Go
to Analyze > Descriptive Statistics > Frequencies. Now, we will explore some more options. In
the Frequencies dialog box, click the Statistics button where, you will find four types of
descriptive stats (Percentile Values, Central Tendency, Dispersion and Distribution) that were
discussed in previous chapters.

Now, let us try to understand and


interpret the types of central
tendency - mean, median, mode and
sum. Mean is the average of your
data. It is one of the most commonly
reported statistic and it's a norm to
often, report mean in journals. In the
descriptive analysis part, even if you
are writing a project or
communicating in a daily life
situation, often we talk about the
mean rather than median or mode
but means are susceptible to extreme
values. Mean values are highly
influenced by extreme scores –
Figure 25 : Frequencies: Statistics dialog box

© Sanjay Singh
21

whether they are high or low.

If your sample has very high standard deviation, then rather than reporting the mean, it is
preferable to report the median. Similarly, if you have a non-normal data, you are supposed to
report the median because it is basically a value that divides your entire observation, entire series
into two equal parts. 50% of subjects scoring below a particular number that we consider median
and 50% subjects scoring above the median value. In that case, median is also trying to
communicate the location of the scores. So, median can be a sort of location below which 50%
values lie and above which 50% values lie. In that sense, median is also a type of percentile and
you can easily guess, its 50th percentile. In any data set 50th percentile and median values are
always same. Generally, we report the mean but if there is very high dispersion in your data or
your data is non normal, so instead of reporting mean, report the median. If you do a
nonparametric analysis like Kruskal-Wallis test, Mann-Whitney test or any other test, so in that
case it's a rule to report median instead of mean values because you apply a nonparametric test,
generally in case of violation of normality assumption. So, that's why you don't report mean for
the nonparametric test and instead, report the median.

Mode is the value that most frequently occurs in a data set. You can also find out the mode by
looking at Frequencies but if you check on the Mode option in Statistics dialog box, so SPSS will
give only a single value which is most commonly reported in a particular dataset. If you click on
the Sum option, it will give you sum total of all observations for a variable. So, if you select Sum
for age or income, so it is will give a Sum total of all the observations for age and income. Now,
sum total for all the observations for age might be meaningless because you are not looking for
total of all the subjects' age but calculating Sum for income might be instructive. For example, I
have two data sets and I want to know the sum of income in both data sets. So, using Sum option
might give some insight about the data. Yet, it is very less used because you can fairly imagine
that Sum is not very helpful, as compared to the functionality of mean, median and mode.

43. CALCULATING AND INTERPRETING MEAN, MEDIAN & MODE

In the diabetes_costs data set, let us try to calculate average age of the subject and average
income of the subject, median age and median income, the most frequent age or mode of age
and mode of income variables.

Go to Analyze > Descriptive Statistics >


Frequencies. Shift Age and Household
income variable to the Variable(s) list

Figure 26 : Age and Income variables shifted to Variable(s) list


© Sanjay Singh
22

(Figure 1). The ‘Display frequency tables’ checkbox will help in comparing the mode with the
most frequent observation and they will be the same.

Select Statistics button and in the Frequency: Statistics dialog


box, select Mean, Median, Mode and Sum under Central
Tendency (Figure 2). Click Continue and then, click OK.

Figure 27 : Selecting options in Frequencies: statistics

In the Statistics table in output (Figure 3), there


are total valid 250 cases for both, Age and
Household Income. So, there is no missing
value in the data. The average age is 44 years
and average household income is $44,969.86.
The median for Age is slightly higher than
mean and it is 45 years. For income, the
median is a bit lower, i.e. $42,625. We can
interpret now. The median age, i.e. 45 years,
implies that 50 percent individuals are lying
below 45 years of age and 50 percent
Figure 28 : Statistics output for Mean, median, Mode and Sum

© Sanjay Singh
23

individuals are lying above 45 years of age. So, 45, is basically, a number that is dividing the
entire series of observations into two equal parts - 50% above it and 50% below it. If the
distribution of age is non-normal, so in that case, reporting median would be more instructive, as
compared to mean value. The mode for Age is 51 and for Household Income, the mode is
$22,600.

What exactly are mean, median and modes? They are the measures of central tendency which tell
us where most of the subjects are located. Considering the age variable, measure of central
tendency is supposed to tell - What is the most common age for all the subjects studied in this
sample of 250 individuals? Measures of central tendency simply convey where most of the
scores are concentrated.. So, what is the most typical age of this sample? If you consider mean as
a measure of central tendency, it is 44 years. If you consider median, it is 45 years and if you
consider mode, it is 51 years. The huge difference between Mean (44)/Median (45) and Mode
(51) depict that mode can be misleading. Modes would be used only when we have sufficient
understanding of it. So, in this case, using mode is altogether not recommended because it is not
describing the most common age value that is occurring in this observation of 250 individuals.
That is meaningless! Let’s take another instance of a shoe shop where you have to report the
most frequently bought or most common shoe size preferred by the customers. Our experience
suggests that it's generally number 8 and number 9 for males. So, in that case, mode might be a
very useful statistic and not, mean or median. For the current case, mode is not a very
meaningful statistic. You can remember a simple heuristic, as given in Figure 4, to remember
which central tendency you should choose on the basis of the nature of your data.

NATURE OF DATA MEASURE OF CENTRAL TENDENCY


Continuous and normal Mean
Continuous and non-normal Median
Data in Frequencies Mode
Figure 29 : Which central tendency to use for which type of data?

This gives you an idea in which situation to quote mean and in which situation to quote median
or mode? Sum is sum total of all the ages. Again, sum is a meaningless number. For age, sum is
10,999. It is quite meaningless to get a Sum of ages as it is not giving us any valid information
but is only useful for the sake of understanding.

© Sanjay Singh
24

44. CONFIRMING MODE WITH FREQUENCIES

We will continue here with the central


tendencies output for ‘diabetes_costs’ data set
from the previous chapter. Let us look at the
figures for Household Income. Mean
household income is $44,969.86; Median is
$42,625 and mode is $22,600. Mode is highly
distinct, as compared to mean or median. So,
it's basically just by chance that there are more
number of people who are getting a salary of
$22,600.

Figure 30: Statistics table mentioning central tendencies

Figure 31: Matching Frequency table with Mode

If you look at the frequency chart, it will be again confirmed. So, first look at the frequency chart
for the age. Mode for age is 51 so we expect the persons of age 51 to be repeated maximum
number of times in table ‘Age in years’ under Frequency Table in the output. You will find that
there are 14 individuals of age 51 and the age with second lowest number of individuals is 44 (12
individuals). In case of a continuous variable, it is preferable to use the mean or median (and not
mode) depending upon the normality of data. For Household Income, mode is 22,600. Check the
Frequency table for Household Income for the observation ‘$22,600’ and you will see that all
other incomes are appearing only once in this entire series. So, the software has picked only the

© Sanjay Singh
25

smallest number as the mode, i.e. $22,600 for which Frequency is 1 and Frequency Percent is .4.
That confirms our understanding that mode is generally misleading, in case of continuous data.
So, that's how we calculate the mean, median, mode and report it.

45. EXPLORE OPTION – CALCULATING GROUPED DESCRIPTIVES

How do we report mean, median, and mode for grouped data? Suppose, we want to report the
averages for males and females, separately, as categorical data. Let us take the example of car
sales data set from IBM SPSS Samples folder. You can find this sample through the following
path - C:Program Files\IBM\SPSS\Statistics\23\Samples\English\car_sales.sav. Now, in this data
set, we have lot of categorical and continuous variables, as well and the research question is to
find out how many car types we have. If you click on this variable, so there are two car types in
this data set – Automobile and Truck. Apart from that, variables given are sale, resale price,
engine horsepower for different vehicle types along with manufacturer, model, etc. Currently, we
are interested in finding whether the average sales price and resale price of automobiles are
higher, as compared to trucks.

For that purpose, go to Analyze


> Descriptive Statistics >
Explore. In Explore dialog box,
now we want to calculate the
average sale and resale value of
the different vehicle types.
Shift the variables ‘Sales in
thousands’ and ‘4-year resale
value’ to Dependent List and
‘Vehicle type’ to Factor List
(Figure 1).

Figure 32 : Selecting variables for calculating grouped descriptives

© Sanjay Singh
26

This will provide us with mean, median and


other descriptive statistics for the two vehicle
types. But for this, we need to select few things
in the Statistics option in the Explore dialog
box. Descriptives is selected, by default (Figure
2) and if you select this option, it gives you the
95% confidence interval of mean and that is
often, a less used practice in journals, as well in
any report, business or otherwise, that people
don't report the 95% confidence interval of
Figure 33 : Descriptives option in Explore: Statistics mean and instead, just report the averages.
Averages are not appropriate as they just
provide the situation of the sample. It won’t describe the likelihood of getting the same kind of
mean in the population. So, in this case, suppose we have some average sale or resale value for
these car types. So, if these average sale and average resale values are likely to prevail in the
population, as well, what would be the range? So, in that case, it is better you choose this option
UP FOR A CHALLENGE?

Take the variables Vehicle Type and Model and Manufacturer and compare
which among BMW and Audi cars have a better sales and resale price.

that gives 95% confidence interval of mean which is going to tell us that - if you do the analysis
for some unseen sample, same type of sample drawn from the same type of population, then
means are likely to remain between a certain range in 95% of cases. So, keep this option checked
(You can define this value depending upon your requirements). For example, if you want to find
out – What would be the average resale value in 99% of cases? So, in that case, you can define it
as a 99% confidence interval. Currently, we will take default as 95% because that's the default
value and is generally accepted as the standard in statistical analysis. Click Continue. We're not
checking any other option, right now. We’re saving to learn them in the latter chapters. In the
Display Option, There are two options - Statistics, Plots, or Both. So, Explore option gives you
certain plots, as well which are the normality plots and descriptive plots. So, currently, we do not
need plots. So, you can simply click on the Stats but just to have a feel of what it offers to us,
click Both. So, click OK.

© Sanjay Singh
27

46. EXPLORE OPTION – INTERPRETING GROUPWISE MEAN AND 95%


CONFIDENCE INTERVAL OF MEAN

Figure 34 : Case Processing Summary

In the output for grouped descriptives (derived from previous chapter), we get a different kind of
arrangement. The first table is case processing summary which simply tells us that we have two
variables in our analysis – Sales in thousands and 4-year resale value. This table gives you the
details of each group. For Sales in thousands, in automobile group, there are 91 observations and
in the Truck group, there are 3 0observations. Similarly, the observations for resale variable are
91 Automobiles and 30 Trucks. Individual percentages of these variables have also been
calculated individually. The percentages are 78% and 73%. The missing case percentages for
both sale and resale for Automobiles or Trucks is 21.6% and for Trucks, it is 26.8%. Basically,
number of observations in each group plus missing cases is equal to Total or 100%.

Next is the main descriptive dialog box.


So, your output has been organized
based on two main variables – Sales in
thousands and 4-year resale value and
we want to calculate the averages of
trucks and automobiles, for sale, as well
as resale. The mean sale price of the
automobile is $46.89502, with standard
error of 5.52. Sale and resale value is
given in thousands of dollars. 95%
confidence interval of mean implies that
in 95% of these cases, the average sale
price of automobile is going to lie
between 35.92 to 56.86 thousand
Dollars. So, that's quite a huge range.

Figure 35 : Descriptives output

© Sanjay Singh
28

Isn’t it? A major reason for this is that standard deviation is quite good for this.

47. 5% TRIMMED MEAN – CONCEPT, USE & INTERPRETATION

The output from previous


chapter also provides 5%
trimmed mean. Trimming is
done by eliminating all the
outliers or the extreme
values. If it is a non-normal
data, it is better to report 5%
trimmed mean, as compared
to averages or simple mean.
Here, 5% trimmed mean is
39.79. It clearly suggests us
that it is a non-normal data
because there is a huge
difference between the
simple mean and the 5%
trimmed mean. Simple mean
is 46.89 thousand dollars and
5% trimmed is 39.79
Figure 36 : 5% Trimmed mean thousand dollars. So, it is
better to write the 5%
trimmed mean, in our report,
with the reasons for its
preference over the simple
mean.

© Sanjay Singh
29

48. EXPLORE – MEDIAN, STANDARD DEVIATION, VARIANCE, MINIMUM,


MAXIMUM & RANGE

Median value from previous


car_sales output is 27.85
(Figure 1). So, in case of
non normal data, mean,
median, modes are very
distinct from each other and
if it is a normal data, so in
that case, mean, median, and
mode, will be almost same.
In this case, median is
27.85. That is very distinct,
as compared to either mean
or 5% trimmed mean.
Variance is simply, square
of standard deviation.
Standard deviation is 52.67.
So, if you square that, you
will get variance. You may
wonder - why are we having
Figure 37 : Median, Variance, Standard deviatio , Minimum, Maximum, and Range in
Descriptives output the same information in two
different ways. If variance is
square of standard deviation, then what is the need of reporting it? Basically, variance gives us
the picture about the variability in a sample. Standard deviations mainly tell us the spread of a
score from the mean score. So, the purpose is quite different. If you want to talk about the
variability in a data set, mention the variance. If you are denoting to what extent scores are
spread away from mean, then you can report the standard deviation. So, if you square 52.67,you
will get the variance as 2774.89. The minimum sale value is for automobiles is .110 (thousand
dollars)) and maximum sale value is 247.994. Range is the difference between minimum and
maximum. So, if you deduct .110 from 247.994, you will get the range, i.e. 27.884.

© Sanjay Singh
30

49. QUARTILES AND INTER-QUARTILE RANGE USING EXPLORE OPTION

Here, we will discuss the


quartiles and inter-
quartile range from the
car_sales output. If you
arrange all the
percentiles in ascending
order, from zero to
hundred and if you
divide the entire
arrangement into four
categories or four parts,
you will get four
quartiles. So, the first
quartile lies between 0
to 25; second quartile -
25 onwards till 50; third
quartile - 50 onwards till
75 and fourth quartile -
75 to 100. The key points
in the four percentiles or
Figure 38: Inter-quartile Range quartiles are P25, P50,
P75 and P100.

Inter-quartile range is nothing but a difference between P75 and P25. If you deduct the score
below which 75% of scores lie, with the scores below which 25%of cases lie, so that will give
you the inter-quartile range. In this case, the inter-quartile range is 53.23. Inter-quartile ranges
also tell us about the spread of scores. So, what is the gap between P75 and P25? Let's compare
the inter-quartile range for sales of automobiles versus trucks. Inter-quartile range for
automobiles is quite small - 53.23 for Automobiles and 108.641 for Trucks – almost double! So,
the variation for sales value of truck is comparatively much high, as compared to variation in
sale value of automobiles and that can be confirmed by looking at the standard deviation. The
standard deviation for sale of Automobiles is 52.67 and 111.17 for sale of Trucks. Again, the
value for automobiles is around half of the truck category.

© Sanjay Singh
31

50. SKEWNESS AND KURTOSIS

Skewness and kurtosis are the


important measures of deviations
from normality in a data set and
skewness. Let us try to
understand skewness and
kurtosis, in detail. Skewness
basically refers to deviation of
the scores from the mean scores
or average scores or measures of
central tendency. There are two
types of deviations from the
average scores- One, most of the
scores lie on the positive side of
mean, i.e. in a series, most of the
individuals are scoring higher, as
compared to the average value.
This is called positive skewness.
Second type of deviation from
Figure 39 : Skewness and Kurtosis in Output
mean is in the case of negative
skewness, wherein, most of the individuals are scoring below the mean or lower than mean.

For instance, consider the salary of


residents of a plush apartment with rich
residents. If we compare the average
income of the nation and include only the
rich individuals in our data set, then the
mean will be adjusted on the basis of
income of individuals but most of these
individuals have above average income.
Since we have a dataset highly biased
towards rich people who are scoring above
average or drawing a salary higher than the
average salary of a population, then the
sample would be positively skewed. In
contrast to this, in an office in which most
Figure 40 : Positive and Negative Skewness employees are drawing less than average

© Sanjay Singh
32

salary, the distribution of salary might be a


negatively skewed distribution.

51. CALCULATING AND INTERPRETING SIGNIFICANCE LEVEL OF SKEWNESS

In the car_sales output,


what is a significant
amount of skewness in
data ? What is significant
amount of positive
skewness or negative
skewness? In case of
sales of automobiles, the
skewness value is
positive indicating
positive skewness in
data. If it were a negative
value, it would have
implied negative
skewness in data. It
means that most of the
trucks are being sold
over the average value,
which is, 94 thousand
Figure 41 : Skewness values for sales of Automobiles and Trucks dollars. So, we can
conclude that most of the
trucks are being sold above the average value because there is positive skewness in data.
Significant level of skewness is assessed by checking whether this deviation from average value
is significant. For this, you can divide the skewness value, i.e. 2.567 by its respective standard
error. If the resultant value is more than 1.98, so there is a significant skewness at α = .05 level
and if this value is more than 2.58, then we can say that there is significant skewness at α = .01
level and if the skewness divided by its standard error ratio is more than 3.98, so it can be said
that the skewness value is significant at α = .001 level. 1.98, 2.56 & 3.98 are respective z-values
for statistical significance at alpha level .05, .01 and .001 respectively. Keep in mind that if the
ratio of skewness divided by its standard error is more than 1.98, there is a significant amount of
skewness.

© Sanjay Singh
33

What is the relevance of significant skewness or significant kurtosis? Let us understand this with
an example. Suppose we are comparing different type of cars and one company is beating the
drum about selling a significantly large number of cars at higher than the average price and
suppose in the same company, there is a competition between different branches of that same
company. Outlet A claims that they are significantly better as their salesmen are selling cars at
significantly higher prices, higher than the average price, as compared to Outlet B. So, you can
prove first, whether the outlet A is significantly selling at higher than the average value and then
make a comparison between these two groups. For that, we use the significance value of
skewness and kurtosis. Let us apply this in the current case and assess whether skewness for car
sale prices is significant. Using your calculator, divide the value by standard error to find out
whether there is a significant amount of skewness. Divide 2.567 by 0.427. The answer is 6.011.
Since, it is higher than the three cutoffs (1.98, 2.56 & 3.98). So, there is a significant positive
skewness in the car sales prices. Thus, it is a significantly positively skewed data. The variable
‘sales’ is a very significantly positively skewed variable. Most of the trucks are being, being sold
at a significant, higher than average price and it is the same scenario for automobiles.

Calculate the skewness


in sales value of cars by
dividing ratio by
respective standard error,
2.156 divided by 0.253.
It gives us 8.52, which is
highly positively
significant. Thus, both
automobiles and trucks
are significantly selling
at higher than the
average sale price. Now,
let’s look at resale value?
So, whether resale value
is also higher than the
average, for both
automobiles and trucks?
You can see, 4-year
Figure 42 : Skewness and standard error for 4-year resale value of Automobiles and Trucks resale value is positively
skewed and positively
skewed for both, automobiles and trucks. This is how you use skewness and kurtosis statistics for
deriving meaningful observations.

© Sanjay Singh
34

52. KURTOSIS - CALCULATION, INTERPRETATION AND UNDERSTANDING


SIGNIFICANCE LEVEL
Kurtosis refers to vertical deviation
from normality. In contrast, skewness
can be thought of as a horizontal
deviation. A normal distribution curve
is a typically bell-shaped curve. If the
distribution curve is highly elongated, it
means that there is a leptokurtic
distribution of variable. On the other
hand, if the distribution is highly flat or
plateau-shaped, as compared to the
normal distribution, we call it a
platykurtic distribution (platykurtic,
deriving from 'plate' as it resembles a
plate/saucer). This is referred to as
negative kurtosis. If you look at Figure
1, it will be clear that leptokurtic or
Figure 43 : Types of Kurtosis positive kurtosis is a highly elongated
distribution while mesokurtic is a
typical normal distribution and
platykurtic means a highly flattened
distribution.

In the car_sales case, there is a positive


kurtosis in data for all the variables –
Sales of automobiles and trucks and for
4-year resale value of automobiles and
trucks ( Figure 2) So, the data is
essentially positively skewed or
leptokurtic data. The data is non-
normal and since, skewness and
kurtosis values are significant, so we
can imply that there is a significant
deviation from normality in data. Thus,
instead of reporting mean, we should
report the median. Medians are highly
distinct here, as compared to means
indicating that our data was non-
normal. If it is a normal data, you
would find mean, median, and mode as
the same or having almost same
Figure 44 : Kurtosis values in car_sales output number of observations.

© Sanjay Singh
35

53. STANDARD ERROR OF MEAN – CONCEPT, CALCULATION &


INTERPRETATION

Standard error is usually shown whenever you see the mean in SPSS. Standard error basically
tells you the amount of error that you are going to commit when you calculate the mean from a
sample drawn from a particular population. Since our calculation is entirely based on a sample,
so we cannot be sure whether this sample mean is truly representative of population mean.
Higher the difference or deviation between sample mean and population, higher will be the error.
So, for this, we need a measure and we call this measure a standard error. Standard error
basically describes to what extent, the sample mean deviates from the population mean.

Standard error is a ratio you


get on dividing sample
standard deviation by the
square root of sample size
[SE = SD / SQRoot(N) ]. Let
us try to understand the
standard error, in this case – it
is 5.52 and formula for
standard error equals standard
deviation divided by square
root of sample size. Our
sample size for Automobiles
in Sales is 91. For standard
error, divide 52.67 by the
square root of 91 and the
resultant value is 5.52. So,
Figure 45 : Standard error of mean [SE= SD/SQroot (N)] that's how you calculate and
interpret the standard error of
mean. Keep in mind that your standard error of means should be as less as possible, as that
would describe the extent to which your sample mean is closer to your population mean.

© Sanjay Singh

You might also like