Professional Documents
Culture Documents
TOD501-M22 02 Simple Descriptive Statistics Lecture Slides
TOD501-M22 02 Simple Descriptive Statistics Lecture Slides
TOD501-M22 02 Simple Descriptive Statistics Lecture Slides
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Lecture outline
1. Scales of measurement
2. Frequency tables and histograms
3. Measures of central tendency
4. Measures of dispersion
5. In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Quantitative data
Ratio scale
• Quantifies the variable (giving it magnitude) AND has an absolute zero.
• In other words, a zero on that scale means a ‘true’ zero. Ratios of data values have a
scientific meaning.
• Eg: Length, mass, income, accidents, etc.
Interval scale
• Quantifies the variable (giving it magnitude) BUT has no absolute “zero”, ie the “zero” is
arbitrary!
• The internals on the scale have a meaning but the ratios do not.
• Eg, temperature (°C or °F). Note that 0°C does not mean no temperature and 40°C is
not twice as hot as 20°C.
• Eg, pH, credit scores, GRE and SAT scores, etc
Bhargav Adhvaryu Simple descriptive statistics 3 of 41
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
• Lastly, a subjective scale can also be defined that quantifies variables, but its
measurement is not physical but is psychological. Data on this scale could be analysed
like any other data, as long it is measured by the same person (’subject’). When data
across different persons is to be used then the analyst must ensure that data are
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Data tabulation
Raw and ordered tables
Raw data table Ordered data table
Toll collected (in thousand Rs./month) Toll collected (in thousand Rs./month)
• Difficult to “describe” the data set • Gives the lowest & highest values
Download the file DATA-ICE-SDS-TOLL.XLSX from the cloud drive for the raw data set.
Bhargav Adhvaryu Simple descriptive statistics 7 of 41
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Data tabulation
Frequency distribution table (grouped data)
Cumulative
Bins Relative Cumulative
Tally Frequency (f) frequency (cf)
(class interval) frequency (rf) [%] Frequency (cf)
[%]
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
82%
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Notes on histogram
• The shape of the histogram changes as the width of the bins (class intervals) change.
• The are no norms or rules to decide bin width - the best way is to ‘try out’ a few bin
widths and look for a shape that best describes the data set!
http://www.shodor.org/interactivate/activities/Histogram/#
Choose various bin widths and check out for yourselves the effect!
• Joining the mid-point of the bars in the histogram will give a frequency polygon and as
the number of bins increase, the frequency polygon will get smoother.
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Introduction
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Mean
• Technically known as the arithmetic mean and popularly known as average is calculated
as:
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
• Median is the middle number in the array of numbers in the dataset arranged in ascending
or descending order.
Eg: 3, 5, 2, 1, 6, 10, 9 (note that dataset has 7 values (odd number))
• To find the median, arrange the numbers in ascending order as:
1, 2, 3, 5, 6, 9, 10 (here the middle value of 5 is the median).
• In case of an even numbered dataset, the mean of the middle two values is the median:
Eg: 3, 5, 2, 1, 6, 9, then the median is:
1, 2, 3, 5, 6, 9 and (3+5)/2 = 4
• Mode is the most frequently occurring number in a dataset.
Eg: 1, 6, 3, 4, 3, 7, 9, 3, 11, 8, 3, 8
• The mode is: 3
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Affected by extreme values Not affected by extreme values Not affected by extreme values
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
“Averaging” means
A quick note
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
𝟏
𝒏 𝒏 𝒏
𝒏
Geometric mean = 𝒙𝒊 = 𝒙𝒊
𝒊 𝟏 𝒊 𝟏
• When there are more items in a dataset then manual calculations get cumbersome and
therefore logarithms may be used to simplify:
∑𝒏𝒊 𝟏 𝐥𝐧 𝒙𝒊 ∑𝒏𝒊 𝟏 𝐥𝐧 𝒙𝒊
𝐥𝐧 𝐆𝐌 = ∴ GM = 𝐞𝐱𝐩
𝒏 𝒏
Logarithm is the power to which a fixed number (a base) is raised to obtain a given number: log 1 000 = 3 ⇒ 10 = 1000
Bhargav Adhvaryu Simple descriptive statistics 18 of 41
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
From – Speed Distance Download the file DATA-ICE-SDS-MCT.XLSX from the cloud drive for the
Time (h) raw data set.
to (km/h) (km)
A–B 30 20 2/3 Q: What is the average speed of the entire journey for
B–C 40 20 1/2 a car travelling from A to E?
C–D 50 20 2/5 (30+40+50+60)/4 = 45 km/h
D–E 60 20 1/3 This is WRONG!
The correct answer is:
80 km 80
Average speed = = ≈ 42.105 km/h
2 1 2 1 1.9
+ +
3 2 5 3+ h
𝒏 4 4
Average speed = = = ≈ 42.105 km/h
𝟏 1 1 1 1 0.095
(using harmonic ∑𝒏𝒊 𝟏 𝒙 30 + 40 + 50 + 60
mean formula)
Bhargav Adhvaryu Simple descriptive statistics 21 of 41
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
120
Q: Find the arithmetic
100 mean, harmonic mean,
80 and arithmetic mean
60 48 50 45 45 46 50
43 40 38 43 40 42 without the outlier of
40 35 37 34 35 35 40 34
20
the salary data of 20
0 marketing managers?
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Q2: Using the file DATA-ICE-SDS-TOLL.XLSX from the cloud drive, calculate the mean, median,
and mode (for ungrouped data).
Solution to Q2
𝒏
𝒊 𝟏 𝒊
𝒕𝒉 𝒔𝒕
*
∗
This is smallest value of mode. The data could have more than one mode.
Bhargav Adhvaryu Simple descriptive statistics 23 of 41
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Range
Neighbourhood A Neighbourhood B
Download the file DATA-ICE-SDS-MOD.XLSX from the cloud drive for
Income in Income in data used in his section.
Household thousand Household thousand
Rs./month Rs./month • The mean is the same for both datasets, ie 30.
A 10 A 28 However, intuitively we can see that these
B 20 B 29 datasets have different characteristics.
C 30 C 30 • This difference can be measured using range,
D 40 D 31 which is the difference between the highest
E 50 E 32 and lowest values:
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Average deviation
30 20 In such cases, the difference between the two datasets can be captured by average
deviation, which examines each value in the dataset w.r.t the mean.
30 20
Average deviation
30 20
∑𝒏𝒊 𝟏 𝒙𝒊 − 𝒙̄
30 20
AD =
30 60 𝒏
Note: 𝒙𝒊 − 𝒙̄ is called modulus, which ignores the sign
50 60
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Variance
Background
• However, if we consider the sign of all the deviations from the mean, then the sum of
• To avoid this, we square the deviations from the mean, sum them, and then average
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Variance
Introduction
𝑵 𝟐 𝒏 𝟐
𝟐 𝒊 𝟏 𝒊 𝟐 𝒊 𝟏 𝒊
Notes:
[1] The symbols σ, 𝑠 are for population and sample variance, respectively.
[2] The symbols 𝜇, 𝑥̄ are for population and sample means, respectively.
[3] The symbols 𝑁, 𝑛 are size of population and sample, respectively.
Note: When sampling, usually 𝝁 is not known and only 𝑥̄ is known, in which case (𝒏 − 𝟏) is
used instead on 𝑵. This is because there are only (𝒏 − 𝟏) independent deviations 𝒙𝒊 − 𝒙 .
In other words, the value of a particular deviation is ALWAYS equal to negative of the sum
of (𝒏 − 𝟏) deviations. This means that there are only (𝒏 − 𝟏) values available for calculating
the mean of the squared deviations; one observation is NOT free to vary. This concept is
known as DEGREES OF FREEDOM and is extensively used in statistics.
Bhargav Adhvaryu Simple descriptive statistics 28 of 41
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Standard deviation
Introduction
• It should be noted that the units of variance would be square of the unit of the data
• To avoid this, the square root of variance is taken giving us one of the most popular
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Standard deviation
Alternative short-cut formulas
• As the value of 𝒏 increases, calculations using the formula on the previous slide tends to
become cumbersome, especially using hand-held calculators, for which the following
alternative formulas may be used:
𝒏 𝟐
𝒏 𝟐 𝒊 𝟏 𝒊 𝒏
𝒊 𝟐 𝟐
𝒊 𝟏 𝒊 𝟏 𝒊
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Standard deviation
Calculations
Neighbourhood A Calculate SD for Neighbourhood B
Calculate SD for
Income in
Household thousand
Neighourhoood A? Household
Income in
thousand
Neighourhoood B?
Rs./month • Population SD for
Rs./month • Population SD for
A 10 neighbourhood A A 28 neighbourhood B
B 20 = 14.14 B 29 = 1.414
C 30 • Sample SD for
C 30 • Sample SD for
D 40 D 31 neighborhood A
neighborhood A
E 50 E 32 = 1.581
= 15.81
Download the file DATA-ICE-SDS-MOD.XLSX from the cloud drive for data used in his section.
Bhargav Adhvaryu Simple descriptive statistics 31 of 41
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Coefficient of variation
• When two datasets with different units need to be compared for dispersion then
coefficient of variation (CV) is used.
• CV is standard deviation expressed as percentage of the mean:
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Note:
Mean = 104.5
Median (Q2, 50th
• If there are outliers, they are
shown (in most software) percentile) = 109.5
• Lower outlier are values less
than (Q1−1.5IQR)
1st quartile (Q1, 25th
• Higher outliers are values
greater than (Q3+1.5IQR)
percentile) = 65.75
• In that case, the whiskers end at
Minimum = 21
the (Q1−1.5IQR) and
(Q3+1.5IQR), respectively.
Bhargav Adhvaryu Simple descriptive statistics 33 of 41
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Kurtosis
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
Q3: Using the file DATA-ICE-SDS-TOLL.XLSX from the cloud drive, calculate the following:
2. Minimum, maximum, 1st quartile, 3rd quartile, interquartile and range (IQR)
4. Calculate the skewness and kurtosis (using Excel Data Analysis functionality or SKEW
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
Scales of measurement Frequency tables & histograms Measures of central tendency Measures of dispersion In-class case study
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
1. Download the case study file BACS01-DS01-MBA SALARIES.PDF and the associated
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.
Lecture course • TOD501: Probability and Statistics • Amrut Mody School of Management, Ahmedabad University, Monsoon 2022
=FREQUENCY(CELLREF1:CELLREF2, =MIN(CELLREF1:CELLREF2)
BINARRAY1:BINARRAY2) =MAX(CELLREF1:CELLREF2)
=AVERAGE(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,0)
=MEDIAN(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,1)
=MODE.S(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,2)
=MODE.MULT(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,3)
=STDEV.P(CELLREF1:CELLREF2) =QUARTILE.INC(CELLREF1:CELLREF2,4)
=STDEV.S(CELLREF1:CELLREF2) =SKEW(CELLREF1:CELLREF2)
=VAR.P(CELLREF1:CELLREF2) =KURT(CELLREF1:CELLREF2)
=VAR.S(CELLREF1:CELLREF2) =ABS(CELLREF1:CELLREF2)
This document is for use ONLY by the students who attend this lecture
and should NOT be circulated and used outside this group of students.