SB 2023 Lecture3

Session 3: Descriptive Statistics
Statistics for Business

Dr. Le Anh Tuan
1
Numerical Description
Three key characteristics of numerical data:
►Center:
►Where are the data values concentrated?
►What seem to be typical or middle data values? Is
there central tendency?
►Variability
►How much dispersion is there in the data?
►How spread out are the data values?
►Are there unusual values?
►Shape
►Are the data values distributed symmetrically?
Skewed? Sharply peaked? Flat? Bimodal?
2
Measures of Central Tendency
3
Measures of Central Tendency
► Central tendency is a single value used to describe the center
point of a data set.
Central Tendency
Mean Median Mode
Weighted
Mean
4
Mean
► The mean, or average is the most common measure of central
tendency.
► Calculate the mean by adding all the values in a data set and
the diving the results by the number of observations.
5
Mean
► For the population:
N
åx i
x1 + x 2 + ! + x N Population values
μ= i=1
=
N N
Population size
► For the sample:
n
Observed values
åx i
x1 + x 2 + ! + x n
x= i=1
=
n n
Sample size
6
Mean
► Suppose a sample size n=10 gives the following values:
4 3 8 10 0 2 3 7 5 2
► The sample mean is
4 + 3 + 8 + 10 + 0 + 2 + 3 + 7 + 5 + 2
"̅ =
10
44
= = 4.4
10
7
Mean
► Advantages:
► Easy to calculate
► Summarizes the data with a single value
► Disadvantages:
► Affected by outliers (values that are much higher or
lower than most of the data)
8
Mean
► Affected by outliers (values that are much higher or lower than
most of the data)
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1 + 2 + 3 + 4 + 10 20
= =4
5 5
1 + 2 + 3 + 4 + 5 15
= =3
5 5
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
9
Weighted Mean
► A weighted mean allows you to assign more weight to
certain values and less weight to others.
► Formula for the weighted mean:
∑(%&'(*% "% )
"̅ =
∑(%&' *%
► *% = the weight for each data value x-

► ∑(%&' *% = the sum of all the weights.
Weighted Mean
► A GPA is a weighted average that gives greater weight to
courses that earn more credits. Grade ranges from 0 to 4. A’s
grade points are 4.0 in Science, which is worth 4 credits, 3 in
English, which is worth 3 credits, and 3.5 in Physics, which is
worth 2 credits. What is A’s GPA?
► Solutions:
► Multiply each grade by its weight. Add the products
and then divide by the sum of the weights.
4×4 + 3×3 + 3.5×2 32

"̅ = =
4+3+2 9
= 3.56
► A’s grade point average is approximately 3.56.
11
Questions
► You invest in stock market, the returns are strongly dependent
on market conditions. The details as follows:
Stock returns
Market condition Likelihood Returns
Up 45% 20%
Neutral 20% 14%
Down 35% -5%
► Calculate expected return for this investment?
12
The Median
13
Median
► The median is the value in the data set for which half
observations are higher and half the observations are
lower.
► In other words, the median (M) is the 50th percentile or

midpoint of the ordered sample data.
14
Median
► Not Affected by outliers
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
Median = 3
0 1 2 3 4 5 6 7 8 9 10
15
Median
► The location of the median:
n +1
Median position = position in the ordered data
2
Where n is the number of observations.
► If the number of values is odd, the median is the

middle number.
► If the number of values is even, the median is the
average of the two middle numbers.
► Note that (n+1)/2 is not the value of the median, only the
position of the median in the ranked data
16
Median
► Suppose a sample size n=9 with order gives the following
values:
0 3 4 7 9 12 13 17 25
► The median position is
9+1
=5
2
► The median value is 9
17
Median
► Suppose a sample size n=10 with order gives the following
values:
0 3 4 7 9 12 13 17 25 4000
► The median position is
10 + 1 Not affected
= 5.5
2 by outliers
► The median value is

9 + 12
= 10.5
2
18
The Mode
19
Mode
►The mode is the value that appears most often in a data

set.
►If no data value or category repeats more than once,
we say that the mode does not exist.
►More than one mode can exist if two or more values

tie for the most frequent.
►The mode is most useful for discrete or categorical data

with only a few distinct data values. For continuous data or
data with a wide range, the mode is rarely useful.
20
Mode
0 1 2 3 4 5 6
No Mode
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
21
Questions
► Which measure of location is the “best”? Mean or

Median? Mode?
22
Shape of a Distribution
►Describes how data are distributed
►Measures of shape
►Symmetric
►Skewed (Left or Right)
Mean < Median Mean = Median Median < Mean
Left-Skewed Symmetric Right-Skewed
23
Measures of Variability
24
Frequency Distribution
Variation
Range Variance Standard deviation
►Measures of variation give

information on the spread
or variability of the data
values. Same center,
different variation
25
Range
►Simplest measure of variation
►Difference between the largest and the smallest

observations.
Range = Highest Value – Lowest Value
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
26
Range
►Advantages
►Easy to calculate and understand
►Disadvantages
►Only based on two numbers on the dataset. Ignores
the way in which data are distributed
►Sensitive to outliers
27
Variance
►Sample variance is denoted by s2
►Where "̅ = sample mean
►n = sample size
►Xi = ith value of the variable X
28
Variance
►A sample includes 6 observations:

10 12 14 15 17 18 18 24
n=8, Mean "̅ = 16
►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
)',
=
2
= 18
29
Variance
►Population variance is denoted by
►Where ! = population mean
►N = population size
►Xi = ith value of the variable X
30
Standard Deviations
►Standard deviation is the square root of variance
►Has the same units as original data
►Sample standard deviation:
31
Standard deviation
►A sample includes 6 observations:

10 12 14 15 17 18 18 24
n=8, Mean "̅ = 16
►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /()3+),). /('0+),).
3+)
)',
=
2
= 18
Standard deviation: & = 18 = 4.24
A measure of how far on average each data value is
from the mean of the sample.
32
Comparing SD
33
Excel and STATA
► Excel Formulas
► Mean: =Average(Data Values)
► Weighted Mean: =Sumproduct(X1, X2)
► Median: =Median(Data Values)
► Mode: =Mode(Data Values)
► Variance: = Var.S(Data Values)
► SD: =STDEV.S(Data Values)
34
Excel
► Excel Tools: Data/Data Analysis/Descriptive Statistics
35
Excel
36
STATA
► Using command “sum”
37
Using the Mean and Standard
Deviation Together
38
Coefficient of Variation
►The standard deviation is affected by the scale of the data

►When sample means are very different, comparing
SD can be misleading.
►The coefficient of variance, CV, measures SD in terms of its

percentage of the mean.
►A high CV indicates high variability relative to the
size of the mean.
►A low CV indicates low variability relative to the size
of the mean.
►A smaller coefficient of variation indicates more

consistency within a set of data values.
39
Coefficient of Variation
►Sample coefficient of variation
$
!" = (100)
&̅
s=the sample standard deviation

&̅ = the sample mean
►Population coefficient of variation

+
!" = (100)
,
+ =the population standard deviation
,= the population mean
40
Coefficient of Variation example
►Stock market
Price for Stock A Price for Stock B
Mean 100 60
SD 20 15
CV =20/100*100 =15/60*100
=20% =25%
►Although stock A has a larger deviation, the price is more

consistent.
41
Z-score
►Zscore identifies the number of standard deviations a

particular value is from the mean of its distributions.
►A Zscore has no unit.
►Zscore is
►Zero for values equal to the mean
►Positive for values above the mean
►Negative for values below the mean
42
Z-score formula
►Sample Zscore
# − #̅
!=
&
x=the data value of interest
s=the sample standard deviation
#̅ = the sample mean
►Population Zscore
#−'
!=
(
( =the population standard deviation
'= the population mean
43
Unusual observations
►Based on its standardized Zscore, a data value is classified

as:
►Unusual if |Z| > 2 (beyond μ ± 2σ)
►Outlier if |Z| > 3 (beyond μ ± 3σ)
44
Z-score example
►Price for a glass of milk tea (size L) in Vietnam
15K 28K 43K 50K 70K 90K

►Average price: Mean=49.33
►SD = 27.40
►How far is the price of KOI THÉ (90K) from the sample
mean of 49.33 (in SD increments)
45
Z-score example
►A price for a glass of milk tea (size L) in Vietnam
15K 28K 43K 50K 70K 90K

►!"#$%& = ()* − ,). ..)/12. ,* = 3. ,4
►The price of KOI is more than one standard deviation (1.48)
above the sample mean.
46
Empirical rule
►If the data distribution is bell-shaped, symmetrical curve

centered around the mean, we would expect:
About 68% of the values in About 99.7% of the values in
the population to fall within ± the population to fall within ±
1 standard deviation from the 3 standard deviation from the
mean mean
68% 95% 95% 99.7%99.7%
μ
μ ± 2σμ ± 2σ μ ± 3σμ ± 3σ
μ ± 1σ
About 95% of the values in
the population to fall within ±
2 standard deviation from the
mean
47
Chebychev’s Theorem
► For any population with mean μ and standard deviation σ , and k >
1 , the percentage of observations that fall within the interval
[" + $%]
► is at least
)
1− ×100
*+
► Regardless of how the data are distributed, at least (1 - 1/k2) of the

values will fall within k standard deviations of the mean (for k > 1)
48
Grouped Data
49
Grouped Data
► Suppose data are grouped into K classes, with frequencies f1, f2,
. . . fK, and the midpoints of the classes are m1, m2, . . ., mK.
► For a sample of n observations, the mean is
n: the total number of observations

k: the number of classes
► This mean is only an approximate value since the midpoint is just

an estimate of the value in each class.
50
Grouped Data
► The following table gives the frequency distribution of the

number of orders received each morning during the past 30
mornings at a coffee store. Calculate the mean.
Number of orders Frequency (f)
0-4 4
5-9 5
10 - 14 10
15 - 19 7
20 - 24 4
n=30
51
Grouped Data
► Calculate the mean.
► m is the midpoint of the class. It is adding the class limits and
divide by 2.
Number of Frequency Mid-point fm
orders (f) (m)
0-4 4 2 8
5-9 5 7 35
10 - 14 10 12 120
15 - 19 7 17 119
20 - 24 4 22 88
n=30 370
∑ )* ,-.
► !"#$ &̅ = + = ,. = 12.33
► The average number of orders is about 12.33
52
Grouped Data
► For a sample of n observations, the variance is
53
Grouped Data
► Calculate the variance.
Number Frequency (f) Mid-point

of orders (m) 0−2 (0 − 2)5 6 ∗ (0 − 2)5
0-4 4 2 -10.33 106.71 426.84

5-9 5 7 -5.33 28.41 142.04
10 - 14 10 12 -0.33 0.11 1.09
15 - 19 7 17 4.67 21.81 152.66
20 - 24 4 22 9.67 93.51 374.04
n=30 1096.67
$%&'.')
► Variance ! " = *%+$
= 37.82
54
Relative Position
Percentiles, Quartiles, and Box Plots
55
Measures of relative position
► Measures of relation position compare the position of one

value in relation to other values in the data set.
► Measures:
► Percentiles
► Quartiles
► Interquartile
56
Percentiles
► Percentiles are data that have been divided into 100

groups.
► For example, you EQ score in the 80rd percentile on a

standardized test. That means that 80% of the test-takers
scored below you.
► Generally, the pth percentile of a data set (where p is any

number between 1 and 100) is the value that at least p
percent of the observations will fall below.
57
Percentiles
58
Percentiles
► To find percentiles manually:

► Sort the data from lowest to highest
► Calculate the position, I
#
!= & where p = the percentile of interest
$%%
n = number of data values.
► If i is not a whole number, round i to the next whole

number, the ith position represents our value of interest.
► If i is a whole number, the midpoint between ith and i+1th

position is our value of interest.
59
Quartiles
► Quartiles are scale points that divide the sorted data into four
groups of approximately equal size.
► The first quartile, Q1, is the value for which 25% of the observations are
smaller and 75% are larger.
► Q2 is the same as the median (50% are smaller, 50% are larger)
► Only 25% of the observations are greater than the third quartile.
Questions?
► What is the median of the data values below Q2?
► What is the median of the data values above Q3?
Interquartile
► Interquartile range describes the middle 50% of the data.
► Can Eliminate high- and low-valued observations (outliers)
► Interquartile range = 3rd quartile – 1st quartile

IQR = Q3 – Q1
Box-and-Whisker plot
► A box plot (also called a box-and-whisker plot) is a graphical
display showing the relation position of the five-number
summary:
Min , Q1 , Q2 , Q3 , Max
► It also provides outlies if any
**
Outliers
Outliers
►Formulas for the upper and lower limits of outliers
►Upper Limit = Q3 + 1.5 IQR

►Lower Limit = Q1 - 1.5 IQR
►Values beyond these limits are considered outliers
64
Examples
65
STATA
-.4 -.2 0 .2 .4
ROA
66
Association Between Two Variables
(Covariance , Correlation)
67
Covariance
► The covariance measures the direction of the linear relationship between two
variables.
► The population covariance:
► The sample covariance:

n
å (x - x)(y - y)
i i
Cov (x , y) = s xy = i=1
n -1
► Only concerned with the direction of the relationship (positive, negative, no
relationship)
► No causal effect is implied
68
Covariance
► Covariance between two variables:
► Cov(x,y) > 0 x and y tend to move in the same direction
► Cov(x,y) < 0 x and y tend to move in opposite directions
► Cov(x,y) = 0 x and y are independent
69
Correlation Coefficient
► The sample correlation coefficient, rxy measures both the strength and
direction of the linear relationship between two variables.
► Formula for population correlation coefficient:
Standard
deviations for the
► Formula for sample correlation coefficient: x or y variable
Cov (x , y)
r=
sX sY
70
Correlation Coefficients
► Unit free
► Ranges between –1 and 1
► The closer to –1, the stronger the negative linear relationship
► The closer to 1, the stronger the positive linear relationship
► The closer to 0, the weaker any positive linear relationship
71
Correlation Coefficients
72
Excel
► Covariance for the sample:
=COVARIANCE.S(X DATA VALUES, Y DATA VALUES)
► Correlation for the sample:

=CORREL (X DATA VALUES, Y DATA VALUES)
Excel Tools: Data/Data Analysis/Covariance

Data/Data Analysis/Correlation
73
Skewness
1 4
% 2. − 2̅
!"#$%#&& = -
(% − 1)(% − 2) &
./0
74
Kurtosis
► Kurtosis refers to the relative length of the tails and the degree of
concentration in the center.
► A normal bell-shaped population is called mesokurtic and serves
as a benchmark.
► A population that is flatter than a normal population (i.e., has

heavier tails) is called platykurtic, while one that is more sharply
peaked than a normal population (i.e., has thinner tails) is
leptokurtic.
► Kurtosis is not the same thing as variability, although the two are
easily confused.
► A histogram is an unreliable guide to kurtosis because its scale
and axis proportions may vary, so a numerical statistic is needed:
75
Kurtosis
76
Exercise
► Review Session 3, Online Quiz 3.
► Homework, Group Assignment
► Reading Chapter 4. Probability
77

SB 2023 Lecture3

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SB 2023 Lecture3

Uploaded by

Copyright:

Available Formats

Session 3: Descriptive Statistics

Statistics for Business

Mean Median Mode

► The sample mean is

► Formula for the weighted mean:

► *% = the weight for each data value x-

4×4 + 3×3 + 3.5×2 32

► Calculate expected return for this investment?

► In other words, the median (M) is the 50th percentile or

► If the number of values is odd, the median is the

► The median position is

► The median value is 9

► The median position is

► The median value is

►The mode is the value that appears most often in a data

►More than one mode can exist if two or more values

►The mode is most useful for discrete or categorical data

► Which measure of location is the “best”? Mean or

Left-Skewed Symmetric Right-Skewed

Range Variance Standard deviation

►Measures of variation give

►Simplest measure of variation

►Difference between the largest and the smallest

►Sample variance is denoted by s2

►Where "̅ = sample mean

►Xi = ith value of the variable X

►A sample includes 6 observations:

►Population variance is denoted by

►Where ! = population mean

►Xi = ith value of the variable X

►Has the same units as original data

►Sample standard deviation:

►A sample includes 6 observations:

►The standard deviation is affected by the scale of the data

►The coefficient of variance, CV, measures SD in terms of its

►A smaller coefficient of variation indicates more

►Sample coefficient of variation

s=the sample standard deviation

►Population coefficient of variation

Price for Stock A Price for Stock B

►Although stock A has a larger deviation, the price is more

►Zscore identifies the number of standard deviations a

►A Zscore has no unit.

►Based on its standardized Zscore, a data value is classified

15K 28K 43K 50K 70K 90K

15K 28K 43K 50K 70K 90K

►If the data distribution is bell-shaped, symmetrical curve

68% 95% 95% 99.7%99.7%

► Regardless of how the data are distributed, at least (1 - 1/k2) of the

n: the total number of observations

► This mean is only an approximate value since the midpoint is just

► The following table gives the frequency distribution of the

► For a sample of n observations, the variance is

Number Frequency (f) Mid-point

0-4 4 2 -10.33 106.71 426.84

► Measures of relation position compare the position of one

► Percentiles are data that have been divided into 100

► For example, you EQ score in the 80rd percentile on a

► Generally, the pth percentile of a data set (where p is any

► To find percentiles manually:

► If i is not a whole number, round i to the next whole

► If i is a whole number, the midpoint between ith and i+1th