Professional Documents
Culture Documents
SB 2023 Lecture3
SB 2023 Lecture3
1
Numerical Description
Three key characteristics of numerical data:
►Center:
►Where are the data values concentrated?
►What seem to be typical or middle data values? Is
there central tendency?
►Variability
►How much dispersion is there in the data?
►How spread out are the data values?
►Are there unusual values?
►Shape
►Are the data values distributed symmetrically?
Skewed? Sharply peaked? Flat? Bimodal?
2
Measures of Central Tendency
3
Measures of Central Tendency
► Central tendency is a single value used to describe the center
point of a data set.
Central Tendency
Weighted
Mean
4
Mean
► The mean, or average is the most common measure of central
tendency.
► Calculate the mean by adding all the values in a data set and
the diving the results by the number of observations.
5
Mean
► For the population:
N
åx i
x1 + x 2 + ! + x N Population values
μ= i=1
=
N N
Population size
► For the sample:
n
Observed values
åx i
x1 + x 2 + ! + x n
x= i=1
=
n n
Sample size
6
Mean
► Suppose a sample size n=10 gives the following values:
4 3 8 10 0 2 3 7 5 2
4 + 3 + 8 + 10 + 0 + 2 + 3 + 7 + 5 + 2
"̅ =
10
44
= = 4.4
10
7
Mean
► Advantages:
► Easy to calculate
► Summarizes the data with a single value
► Disadvantages:
► Affected by outliers (values that are much higher or
lower than most of the data)
8
Mean
► Affected by outliers (values that are much higher or lower than
most of the data)
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1 + 2 + 3 + 4 + 10 20
= =4
5 5
1 + 2 + 3 + 4 + 5 15
= =3
5 5
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
9
Weighted Mean
► A weighted mean allows you to assign more weight to
certain values and less weight to others.
∑(%&'(*% "% )
"̅ =
∑(%&' *%
= 3.56
► A’s grade point average is approximately 3.56.
11
Questions
► You invest in stock market, the returns are strongly dependent
on market conditions. The details as follows:
Stock returns
Market condition Likelihood Returns
Up 45% 20%
Neutral 20% 14%
Down 35% -5%
12
The Median
13
Median
► The median is the value in the data set for which half
observations are higher and half the observations are
lower.
14
Median
► Not Affected by outliers
0 1 2 3 4 5 6 7 8 9 10
Mean = 3
Median = 3
0 1 2 3 4 5 6 7 8 9 10
15
Median
► The location of the median:
n +1
Median position = position in the ordered data
2
Where n is the number of observations.
► Note that (n+1)/2 is not the value of the median, only the
position of the median in the ranked data
16
Median
► Suppose a sample size n=9 with order gives the following
values:
0 3 4 7 9 12 13 17 25
9+1
=5
2
17
Median
► Suppose a sample size n=10 with order gives the following
values:
0 3 4 7 9 12 13 17 25 4000
10 + 1 Not affected
= 5.5
2 by outliers
18
The Mode
19
Mode
20
Mode
0 1 2 3 4 5 6
No Mode
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
21
Questions
22
Shape of a Distribution
►Describes how data are distributed
►Measures of shape
►Symmetric
►Skewed (Left or Right)
Mean < Median Mean = Median Median < Mean
23
Measures of Variability
24
Frequency Distribution
Variation
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
26
Range
►Advantages
►Easy to calculate and understand
►Disadvantages
►Only based on two numbers on the dataset. Ignores
the way in which data are distributed
►Sensitive to outliers
27
Variance
►n = sample size
28
Variance
►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
)',
=
2
= 18
29
Variance
►N = population size
30
Standard Deviations
►Standard deviation is the square root of variance
31
Standard deviation
►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /()3+),). /('0+),).
3+)
)',
=
2
= 18
Standard deviation: & = 18 = 4.24
A measure of how far on average each data value is
from the mean of the sample.
32
Comparing SD
33
Excel and STATA
► Excel Formulas
► Mean: =Average(Data Values)
► Weighted Mean: =Sumproduct(X1, X2)
► Median: =Median(Data Values)
► Mode: =Mode(Data Values)
► Variance: = Var.S(Data Values)
► SD: =STDEV.S(Data Values)
34
Excel
► Excel Tools: Data/Data Analysis/Descriptive Statistics
35
Excel
36
STATA
► Using command “sum”
37
Using the Mean and Standard
Deviation Together
38
Coefficient of Variation
39
Coefficient of Variation
$
!" = (100)
&̅
40
Coefficient of Variation example
►Stock market
Mean 100 60
SD 20 15
CV =20/100*100 =15/60*100
=20% =25%
41
Z-score
►Zscore is
►Zero for values equal to the mean
►Positive for values above the mean
►Negative for values below the mean
42
Z-score formula
►Sample Zscore
# − #̅
!=
&
x=the data value of interest
s=the sample standard deviation
#̅ = the sample mean
►Population Zscore
#−'
!=
(
( =the population standard deviation
'= the population mean
43
Unusual observations
44
Z-score example
►Price for a glass of milk tea (size L) in Vietnam
45
Z-score example
►A price for a glass of milk tea (size L) in Vietnam
46
Empirical rule
μ
μ ± 2σμ ± 2σ μ ± 3σμ ± 3σ
μ ± 1σ
About 95% of the values in
the population to fall within ±
2 standard deviation from the
mean
47
Chebychev’s Theorem
► For any population with mean μ and standard deviation σ , and k >
1 , the percentage of observations that fall within the interval
[" + $%]
► is at least
)
1− ×100
*+
48
Grouped Data
49
Grouped Data
► Suppose data are grouped into K classes, with frequencies f1, f2,
. . . fK, and the midpoints of the classes are m1, m2, . . ., mK.
► For a sample of n observations, the mean is
50
Grouped Data
51
Grouped Data
► Calculate the mean.
► m is the midpoint of the class. It is adding the class limits and
divide by 2.
Number of Frequency Mid-point fm
orders (f) (m)
0-4 4 2 8
5-9 5 7 35
10 - 14 10 12 120
15 - 19 7 17 119
20 - 24 4 22 88
n=30 370
∑ )* ,-.
► !"#$ &̅ = + = ,. = 12.33
► The average number of orders is about 12.33
52
Grouped Data
53
Grouped Data
► Calculate the variance.
$%&'.')
► Variance ! " = *%+$
= 37.82
54
Relative Position
Percentiles, Quartiles, and Box Plots
55
Measures of relative position
► Measures:
► Percentiles
► Quartiles
► Interquartile
56
Percentiles
57
Percentiles
58
Percentiles
59
Quartiles
► Quartiles are scale points that divide the sorted data into four
groups of approximately equal size.
► The first quartile, Q1, is the value for which 25% of the observations are
smaller and 75% are larger.
► Q2 is the same as the median (50% are smaller, 50% are larger)
► Only 25% of the observations are greater than the third quartile.
Questions?
► What is the median of the data values below Q2?
► What is the median of the data values above Q3?
Interquartile
**
Outliers
Outliers
►Formulas for the upper and lower limits of outliers
64
Examples
65
STATA
-.4 -.2 0 .2 .4
ROA
66
Association Between Two Variables
(Covariance , Correlation)
67
Covariance
► The covariance measures the direction of the linear relationship between two
variables.
► The population covariance:
å (x - x)(y - y)
i i
Cov (x , y) = s xy = i=1
n -1
► Only concerned with the direction of the relationship (positive, negative, no
relationship)
► No causal effect is implied
68
Covariance
69
Correlation Coefficient
► The sample correlation coefficient, rxy measures both the strength and
direction of the linear relationship between two variables.
Standard
deviations for the
► Formula for sample correlation coefficient: x or y variable
Cov (x , y)
r=
sX sY
70
Correlation Coefficients
► Unit free
71
Correlation Coefficients
72
Excel
► Covariance for the sample:
73
Skewness
1 4
% 2. − 2̅
!"#$%#&& = -
(% − 1)(% − 2) &
./0
74
Kurtosis
► Kurtosis refers to the relative length of the tails and the degree of
concentration in the center.
► A normal bell-shaped population is called mesokurtic and serves
as a benchmark.
► Kurtosis is not the same thing as variability, although the two are
easily confused.
► A histogram is an unreliable guide to kurtosis because its scale
and axis proportions may vary, so a numerical statistic is needed:
75
Kurtosis
76
Exercise
77