TOD501-M22 02 Simple Descriptive Statistics Lecture Slides

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 41

Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Simple descriptive statistics

Bhargav Adhvaryu Ahmedabad


Professor of Urban Science
Amrut Mody School of Management University
Bhargav Adhvaryu Simple descriptive statistics 1 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Lecture outline

1. Scales of measurement
2. Frequency tables and histograms
3. Measures of central tendency
4. Measures of dispersion
5. In-class case study

Bhargav Adhvaryu Simple descriptive statistics 2 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Scales of measurement (1)

Quantitative data
Ratio scale
• Quantifies the variable (giving it magnitude) AND has an absolute zero.
• In other words, a zero on that scale means a ‘true’ zero. Ratios of data values have a
scientific meaning.
• Eg: Length, mass, income, accidents, etc.
Interval scale
• Quantifies the variable (giving it magnitude) BUT has no absolute “zero”, ie the “zero” is
arbitrary!
• The internals on the scale have a meaning but the ratios do not.
• Eg, temperature (°C or °F). Note that 0°C does not mean no temperature and 40°C is
not twice as hot as 20°C.
• Eg, pH, credit scores, GRE and SAT scores, etc
Bhargav Adhvaryu Simple descriptive statistics 3 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Scales of measurement (2)

Qualitative data (aka categorical data)


Nominal scale
• Classifies variables into “classes” or “groups”, BUT there is no inherent order in such
classification.
• Eg, colour, gender, profession, caste, nationality, etc.
Ordinal scale
• Classifies variables into “classes” or “groups”, AND there is an explicit order in such
classification. Observations can be marked on scales such as:
• High, medium, low
• Excellent, good, satisfactory, poor, bad

Bhargav Adhvaryu Simple descriptive statistics 4 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Scales of measurement (3)

• Quantitative data can be further categorised as:

• Discrete variables: 42 trucks, 1500 households, 1000 persons, etc

• Continuous variables: 6.23 km, 57.46 kg, etc

• Lastly, a subjective scale can also be defined that quantifies variables, but its

measurement is not physical but is psychological. Data on this scale could be analysed

like any other data, as long it is measured by the same person (’subject’). When data

across different persons is to be used then the analyst must ensure that data are

comparable (have parity).

• Eg: Variables measured in viva examination, talent contests, etc.

Bhargav Adhvaryu Simple descriptive statistics 5 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Scales of measurement (4)

Source: https://careerfoundry.com/en/blog/data-analytics/what-is-nominal- Source: https://www.graphpad.com/support/faq/what-is-the-difference-


data/ between-ordinal-interval-and-ratio-variables-why-should-i-care/

Bhargav Adhvaryu Simple descriptive statistics 6 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Data tabulation
Raw and ordered tables
Raw data table Ordered data table
Toll collected (in thousand Rs./month) Toll collected (in thousand Rs./month)

37 95 70 152 21 114 21 51 90 110 125 152


97 93 161 150 40 109 23 54 93 112 132 161
63 38 141 123 51 83 24 63 95 113 135 162
143 148 24 98 98 132 31 65 95 114 140 164
164 184 175 122 100 65 35 65 97 117 140 164
37 65 35 192 164 148 37 66 98 119 141 175
120 162 110 113 46 122 37 66 98 120 143 175
197 66 54 90 140 66 38 70 100 122 148 184
125 119 117 135 81 140 40 81 100 122 148 192
31 95 175 100 23 112 46 83 109 123 150 197
• Gives no discernable information • Better than raw data table

• Difficult to “describe” the data set • Gives the lowest & highest values
Download the file DATA-ICE-SDS-TOLL.XLSX from the cloud drive for the raw data set.
Bhargav Adhvaryu Simple descriptive statistics 7 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Data tabulation
Frequency distribution table (grouped data)
Cumulative
Bins Relative Cumulative
Tally Frequency (f) frequency (cf)
(class interval) frequency (rf) [%] Frequency (cf)
[%]

0–24 ||| 3 5.00% 3 5%


25–49 ||||| || 7 11.67% 10 17%
50–74 ||||| ||| 8 13.33% 18 30%
75–99 ||||| |||| 9 15.00% 27 45%
||||| |||||
13
100–124 ||| 21.67% 40 67%
125–149 ||||| |||| 9 15.00% 49 82%
150–174 ||||| | 6 10.00% 55 92%
175–199 ||||| 5 8.33% 60 100%
Totals 60 100.00% - -
• Data is presented in a more structured and sensible form.
Q1: Using file DATA-ICE-SDS-TOLL.XLSX, construct a histogram for the toll data .

Bhargav Adhvaryu Simple descriptive statistics 8 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Graphical representation of frequency data


Histogram

Settings for creating a


histogram like the one on
the left using built in Excel
charts (version 2016
onwards only)

Bhargav Adhvaryu Simple descriptive statistics 9 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Graphical representation of frequency data


Cumulative frequency curve (Ogive curve)

This means that 82% of


the values in this data
set are below 150. In
other words, 150 is the
82nd percentile value.

82%

This graph can


be used to
calculate other
percentile
values (eg,
median,
quartiles, etc)

Bhargav Adhvaryu Simple descriptive statistics 10 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Notes on histogram

• The shape of the histogram changes as the width of the bins (class intervals) change.

• The are no norms or rules to decide bin width - the best way is to ‘try out’ a few bin

widths and look for a shape that best describes the data set!

• This following website gives an online tool to make histogram

http://www.shodor.org/interactivate/activities/Histogram/#

Choose various bin widths and check out for yourselves the effect!

• Joining the mid-point of the bars in the histogram will give a frequency polygon and as

the number of bins increase, the frequency polygon will get smoother.

Bhargav Adhvaryu Simple descriptive statistics 11 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Introduction

• Huge data sets need to be summarised meaningfully.


• Frequency tables and histograms indicate how the data is ‘spread’.
• Measure of central tendency (MCT) allow us to see where a particular variable stands in
relation to others (often called average).
• MCT gives brief yet substantial information about the dataset, without overburdening us
with large stacks of numbers.
• The brevity of MCT helps communicate key information about the dataset easily and
quickly.
• Three MCT can be calculated for quantitative data:
• Mean
• Median
• Mode (can be calculated for qualitative data as well)

Bhargav Adhvaryu Simple descriptive statistics 12 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Mean

• Technically known as the arithmetic mean and popularly known as average is calculated
as:

Bhargav Adhvaryu Simple descriptive statistics 13 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Median and mode

• Median is the middle number in the array of numbers in the dataset arranged in ascending
or descending order.
Eg: 3, 5, 2, 1, 6, 10, 9 (note that dataset has 7 values (odd number))
• To find the median, arrange the numbers in ascending order as:
1, 2, 3, 5, 6, 9, 10 (here the middle value of 5 is the median).
• In case of an even numbered dataset, the mean of the middle two values is the median:
Eg: 3, 5, 2, 1, 6, 9, then the median is:
1, 2, 3, 5, 6, 9 and (3+5)/2 = 4
• Mode is the most frequently occurring number in a dataset.
Eg: 1, 6, 3, 4, 3, 7, 9, 3, 11, 8, 3, 8
• The mode is: 3

Bhargav Adhvaryu Simple descriptive statistics 14 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Median, median, mode for a data set

Download the file DATA-ICE-SDS-MCT.XLSX from


Data for managers in a firm
the cloud drive for the raw data set.
Manager# Absent days/year • Mean is 10
1 3 Means few managers had either long or short
2 10
absenteeism.
3 4
4 20 • Median is 5

5 30 Means half of the managers were absent 5 days or less


6 4 and the other half, 5 days or more.
7 15
• Mode is 4
8 6
9 4 Means that more managers were absent 4 days than any

10 4 other time period.


Bhargav Adhvaryu Simple descriptive statistics 15 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Measures of central tendency


Some characteristics

Mean Median Mode

Familiar to most people Not commonly used Not commonly used

Very simple to understand and


Simple to understand and calculate Simple to understand and calculate
calculate
The only MCT that can be used for
Cannot be used for qualitative data Cannot be used for qualitative data
nominal or ordinal data
Least affected due to sampling
Affected by sampling fluctuations Affected by sampling fluctuations
fluctuations

Affected by extreme values Not affected by extreme values Not affected by extreme values

Further mathematical processing No further mathematical processing No further mathematical processing


possible possible possible

Not advisable to be used for


May be used for incomplete datasets May be used for incomplete datasets
incomplete datasets

Difficult to interpret in case of bi- or


- -
multi-modal datasets

Bhargav Adhvaryu Simple descriptive statistics 16 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

“Averaging” means
A quick note

• Say there are two firms A and B, with following characteristics:


• A: 5,000 managers and mean salary of Rs.40,000/month
• B: 3,000 managers and mean salary of Rs. 50,000/month
Find the combined mean salary of the managers?
• Combined mean = (40,000 + 50,000)/2 = Rs.45,000/month This is WRONG!

• The correct method is as follows:


Recall that:
∑ 𝑥 The combined mean
𝑥̅ = ∑ 𝑥 +∑ 𝑥 𝑛 𝑥̅ + 𝑛 𝑥̅
𝑛 = =
𝑛 +𝑛 𝑛 +𝑛
𝑥 = 𝑛𝑥̅ , × , ( , × , ) , , ,
= = = 43,750
, , ,
Bhargav Adhvaryu Simple descriptive statistics 17 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Some other types of means


Geometric mean
• Used for averaging rates of change and constructing index numbers.
• Defined as the 𝒏𝒕𝒉 root of the product of 𝒏 terms:

𝟏
𝒏 𝒏 𝒏
𝒏
Geometric mean = 𝒙𝒊 = 𝒙𝒊
𝒊 𝟏 𝒊 𝟏

• When there are more items in a dataset then manual calculations get cumbersome and
therefore logarithms may be used to simplify:

∑𝒏𝒊 𝟏 𝐥𝐧 𝒙𝒊 ∑𝒏𝒊 𝟏 𝐥𝐧 𝒙𝒊
𝐥𝐧 𝐆𝐌 = ∴ GM = 𝐞𝐱𝐩
𝒏 𝒏
Logarithm is the power to which a fixed number (a base) is raised to obtain a given number: log 1 000 = 3 ⇒ 10 = 1000
Bhargav Adhvaryu Simple descriptive statistics 18 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Some other types of means


Geometric mean: an example
Turnover Rate of
Year
(Rs. Cr) change
Q: What is the average rate of change over the four years?
2016 100 - Download the file DATA-ICE-SDS-MCT.XLSX from the cloud
Answer:
2017 140 1.400 drive for the raw data set.
2018 160 1.143
𝟏. 𝟒𝟎𝟎 + 𝟏. 𝟏𝟒𝟑 + 𝟏. 𝟑𝟏𝟑 + 𝟏. 𝟗𝟎𝟓
2019 210 1.313 ≅ 𝟏. 𝟒𝟒 This is WRONG!
2020 400 1.905 𝟒
Method-1: Compound interest formula Method-2: Geometric mean formula Relationship between rate of
change and compound
𝑷𝒏 = 𝑷𝟎 (𝟏 + 𝒓)𝒏 𝒏
𝟏
𝒏
400 = 100(1 + 𝑟) interest
𝒙𝒊
It has been shown that the
𝒊 𝟏
Solving for 𝑟 we get rate of interest in the
400 compound interest formula is
𝑟= − 1 = 1.414 − 1 = (1.4 × 1.143 × 1.313 × 1.905)
100 indeed the same as the
average rate of change in the
= 0.414 (41.4%) = 1.414 (or 41.4%)
geometric mean.
Bhargav Adhvaryu Simple descriptive statistics 19 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Some other types of means


Harmonic mean
• Defined as the reciprocal of the arithmetic mean of the reciprocal of individual terms:

• Has very limited use (explained in next slides)


1. Generally used for averaging speeds.
2. More appropriate representation of the mean when there are few outliers in the data
set.

Bhargav Adhvaryu Simple descriptive statistics 20 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Some other types of means


Harmonic mean: an example

From – Speed Distance Download the file DATA-ICE-SDS-MCT.XLSX from the cloud drive for the
Time (h) raw data set.
to (km/h) (km)
A–B 30 20 2/3 Q: What is the average speed of the entire journey for
B–C 40 20 1/2 a car travelling from A to E?
C–D 50 20 2/5 (30+40+50+60)/4 = 45 km/h
D–E 60 20 1/3 This is WRONG!
The correct answer is:
80 km 80
Average speed = = ≈ 42.105 km/h
2 1 2 1 1.9
+ +
3 2 5 3+ h

𝒏 4 4
Average speed = = = ≈ 42.105 km/h
𝟏 1 1 1 1 0.095
(using harmonic ∑𝒏𝒊 𝟏 𝒙 30 + 40 + 50 + 60
mean formula)
Bhargav Adhvaryu Simple descriptive statistics 21 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Some other types of means


Harmonic mean: another example
• At times, the harmonic mean is a more appropriate representation of the mean when
there are outliers (extreme values).
180 Salaries of 20 marketing Download the file DATA-ICE-SDS-
managers in Rs.’000 per month MCT.XLSX from the cloud drive for
160 155
the raw data set.
140

120
Q: Find the arithmetic
100 mean, harmonic mean,
80 and arithmetic mean
60 48 50 45 45 46 50
43 40 38 43 40 42 without the outlier of
40 35 37 34 35 35 40 34

20
the salary data of 20
0 marketing managers?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

Bhargav Adhvaryu Simple descriptive statistics 22 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

In-class example (Q2)

Q2: Using the file DATA-ICE-SDS-TOLL.XLSX from the cloud drive, calculate the mean, median,
and mode (for ungrouped data).
Solution to Q2
𝒏
𝒊 𝟏 𝒊

𝒕𝒉 𝒔𝒕

*

This is smallest value of mode. The data could have more than one mode.
Bhargav Adhvaryu Simple descriptive statistics 23 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Types of measures of dispersion

• Types of measures of dispersion


• Range
• Average deviation
• Variance
• Standard deviation
• At times, two data sets may have the same or similar measures of central tendency (eg,
mean or median), but still have “different characteristics”.
• Measures of dispersion try to explain this “variation” which “escapes” measures of
central tendency.
• Whilst a histogram gives us a visual measure of the “spread” in the dataset, measures of
dispersion enable us to quantify it.

Bhargav Adhvaryu Simple descriptive statistics 24 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Range

Neighbourhood A Neighbourhood B
Download the file DATA-ICE-SDS-MOD.XLSX from the cloud drive for
Income in Income in data used in his section.
Household thousand Household thousand
Rs./month Rs./month • The mean is the same for both datasets, ie 30.
A 10 A 28 However, intuitively we can see that these
B 20 B 29 datasets have different characteristics.
C 30 C 30 • This difference can be measured using range,
D 40 D 31 which is the difference between the highest
E 50 E 32 and lowest values:

Calculate mean and Calculate mean and • Range for neighbourhood A: 𝟓𝟎 − 𝟏𝟎 = 𝟒𝟎


median for median for • Range for neighbourhood B: 𝟑𝟐 − 𝟐𝟖 = 𝟒
Neighourhoood A? Neighourhoood B?
Conclusion: Higher range means the distribution is more spread/dispersed; some households are
relatively very poor, and some, very rich.

Bhargav Adhvaryu Simple descriptive statistics 25 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Average deviation

Neighbourhood 1 Neighbourhood 2 Calculate mean, median, range for Neighourhoood 1?


• Mean = 30
• Median = 30
HH income in HH income in • Range = 40
thousand thousand
Rs./month Rs./month Calculate mean, median, range for Neighourhoood 2?
10 20 • Mean = 30
• Median = 20
30 20 • Range = 40

30 20 In such cases, the difference between the two datasets can be captured by average
deviation, which examines each value in the dataset w.r.t the mean.
30 20
Average deviation
30 20
∑𝒏𝒊 𝟏 𝒙𝒊 − 𝒙̄
30 20
AD =
30 60 𝒏
Note: 𝒙𝒊 − 𝒙̄ is called modulus, which ignores the sign
50 60

• AD for neighbourhood 1: 𝟒𝟎/𝟖 = 𝟓


• AD for neighbourhood 2: 𝟏𝟐𝟎/𝟖 = 𝟏𝟓
Conclusion: Higher AD implies that the data points in the dataset have more deviation w.r.t to the mean.

Bhargav Adhvaryu Simple descriptive statistics 26 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Variance
Background

• As noted, the sign is ignored in calculating average deviation.

• Because of this, whether a value is above or below a mean, is not captured.

• However, if we consider the sign of all the deviations from the mean, then the sum of

such deviations will be equal to zero!

• And therefore, the mean of such deviations, will also be ZERO!.

• To avoid this, we square the deviations from the mean, sum them, and then average

them, giving us the statistic called variance.

Bhargav Adhvaryu Simple descriptive statistics 27 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Variance
Introduction

𝑵 𝟐 𝒏 𝟐
𝟐 𝒊 𝟏 𝒊 𝟐 𝒊 𝟏 𝒊

Notes:
[1] The symbols σ, 𝑠 are for population and sample variance, respectively.
[2] The symbols 𝜇, 𝑥̄ are for population and sample means, respectively.
[3] The symbols 𝑁, 𝑛 are size of population and sample, respectively.

Note: When sampling, usually 𝝁 is not known and only 𝑥̄ is known, in which case (𝒏 − 𝟏) is
used instead on 𝑵. This is because there are only (𝒏 − 𝟏) independent deviations 𝒙𝒊 − 𝒙 .
In other words, the value of a particular deviation is ALWAYS equal to negative of the sum
of (𝒏 − 𝟏) deviations. This means that there are only (𝒏 − 𝟏) values available for calculating
the mean of the squared deviations; one observation is NOT free to vary. This concept is
known as DEGREES OF FREEDOM and is extensively used in statistics.
Bhargav Adhvaryu Simple descriptive statistics 28 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Standard deviation
Introduction

• It should be noted that the units of variance would be square of the unit of the data

points, which is a bit difficult to comprehend.

• To avoid this, the square root of variance is taken giving us one of the most popular

statistic called the standard deviation.

Bhargav Adhvaryu Simple descriptive statistics 29 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Standard deviation
Alternative short-cut formulas
• As the value of 𝒏 increases, calculations using the formula on the previous slide tends to
become cumbersome, especially using hand-held calculators, for which the following
alternative formulas may be used:

𝒏 𝟐
𝒏 𝟐 𝒊 𝟏 𝒊 𝒏
𝒊 𝟐 𝟐
𝒊 𝟏 𝒊 𝟏 𝒊

Bhargav Adhvaryu Simple descriptive statistics 30 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Standard deviation
Calculations
Neighbourhood A Calculate SD for Neighbourhood B
Calculate SD for
Income in
Household thousand
Neighourhoood A? Household
Income in
thousand
Neighourhoood B?
Rs./month • Population SD for
Rs./month • Population SD for
A 10 neighbourhood A A 28 neighbourhood B
B 20 = 14.14 B 29 = 1.414
C 30 • Sample SD for
C 30 • Sample SD for
D 40 D 31 neighborhood A
neighborhood A
E 50 E 32 = 1.581
= 15.81

Conclusion: Neighbourhood B has a less dispersed (tighter) distribution of income


compared to Neighbourhood A.

Download the file DATA-ICE-SDS-MOD.XLSX from the cloud drive for data used in his section.
Bhargav Adhvaryu Simple descriptive statistics 31 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Coefficient of variation

• When two datasets with different units need to be compared for dispersion then
coefficient of variation (CV) is used.
• CV is standard deviation expressed as percentage of the mean:

• Dataset 1: mean = 450 km/week, SD = 42 km/week


• Dataset 2: mean = 30,000 Rs./month, SD = 1900 Rs./month
Which data set has more variability or dispersion?
• CV of dataset 1 = (42/450)*100 ≈ 9.3%
• CV of dataset 2 = (1900/30,000)*100 ≈ 6.3%
Conclusion: Dataset 1 has more dispersed data points or has more variability.

Bhargav Adhvaryu Simple descriptive statistics 32 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Graphical representation of data spread


Box (and whisker) plots
Box
Whiskers Maximum = 197
Interquartile range (IQR)
(Q3−Q1) = 140.25 − 65.75 3rd quartile (75th
= 74.5 percentile, Q3) = 140.25

Note:
Mean = 104.5
Median (Q2, 50th
• If there are outliers, they are
shown (in most software) percentile) = 109.5
• Lower outlier are values less
than (Q1−1.5IQR)
1st quartile (Q1, 25th
• Higher outliers are values
greater than (Q3+1.5IQR)
percentile) = 65.75
• In that case, the whiskers end at
Minimum = 21
the (Q1−1.5IQR) and
(Q3+1.5IQR), respectively.
Bhargav Adhvaryu Simple descriptive statistics 33 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Graphical representation of data spread


Skewness
• Skewness is a measure of symmetry of the distribution about the mean
• Three are three types of skewness:

Negatively (left) skewed Symmetric Positively (right) skewed


𝐬𝐤𝐞𝐰𝐧𝐞𝐬𝐬 𝟎 𝐬𝐤𝐞𝐰𝐧𝐞𝐬𝐬 = 𝟎 𝐬𝐤𝐞𝐰𝐧𝐞𝐬𝐬 𝟎

Bhargav Adhvaryu Simple descriptive statistics 34 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Graphical representation of data spread


Visualising skewness using box plots

Bhargav Adhvaryu Simple descriptive statistics 35 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Graphical representation of data spread


Kurtosis
• Kurtosis is a measure of “peakness” or “flatness” of a distribution. It is a measure that
combines the relative weights of the tail and the centre.
• Three are three types of kurtosis:

Platykurtic Mesokurtic Leptokurtic


(Peak is flatter than Normal) (Normal distribution) (Peak is taller than Normal)
𝐤𝐮𝐫𝐭𝐨𝐬𝐢𝐬 𝟎 𝐤𝐮𝐫𝐭𝐨𝐬𝐢𝐬 = 𝟎 𝐤𝐮𝐫𝐭𝐨𝐬𝐢𝐬 𝟎
Bhargav Adhvaryu Simple descriptive statistics 36 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

Kurtosis

Bhargav Adhvaryu Simple descriptive statistics 37 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

In-class example (Q3)


Answers: calculating MOD

Q3: Using the file DATA-ICE-SDS-TOLL.XLSX from the cloud drive, calculate the following:

1. Range, average deviation, standard deviation and variance

2. Minimum, maximum, 1st quartile, 3rd quartile, interquartile and range (IQR)

3. Draw a box plot (In Excel if available or use RAWGraphs website)

4. Calculate the skewness and kurtosis (using Excel Data Analysis functionality or SKEW

and KURT functions)

Bhargav Adhvaryu Simple descriptive statistics 38 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study

In-class example (Q3)


Answers to Q3
• Range: 176.00
• Average deviation: 38.9661
• Standard deviation: 𝝈=46.7278 & 𝒔=47.1223
• Variance: 𝝈𝟐 =2183.4831 & 𝒔𝟐 =2220.4912
• 1st quartile (Q1): 65.75
• 2nd quartile (Q2, aka median): 109.50 (recalled from before)
• 3rd quartile: 140.25
• IQR: 74.50
• Box plot is shown
• Skewness: −0.05 (negatively or left skewed)
• Kurtosis: −0.88 (platykurtic, peak is flatter than normal)

Bhargav Adhvaryu Simple descriptive statistics 39 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

In-class case study

1. Download the case study file BACS01-DS01-MBA SALARIES.PDF and the associated

data file BACS01-DS01-MBA SALARIES.XLSX from the cloud drive.

2. Read it and then answer the questions.

Bhargav Adhvaryu Simple descriptive statistics 40 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022

Excel functions learnt for SDS

=FREQUENCY(CELLREF1:CELLREF2, =MIN(CELLREF1:CELLREF2)

BINARRAY1:BINARRAY2) =MAX(CELLREF1:CELLREF2)

=AVERAGE(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,0)

=MEDIAN(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,1)

=MODE.S(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,2)

=MODE.MULT(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,3)

=STDEV.P(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,4)

=STDEV.S(CELLREF1:CELLREF2) =SKEW(CELLREF1:CELLREF2)

=VAR.P(CELLREF1:CELLREF2) =KURT(CELLREF1:CELLREF2)

=VAR.S(CELLREF1:CELLREF2) =ABS(CELLREF1:CELLREF2)

Bhargav Adhvaryu Simple descriptive statistics 41 of 41

This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.

You might also like