Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

Session 3: Descriptive Statistics

Statistics for Business


Dr. Le Anh Tuan

1
Numerical Description
Three key characteristics of numerical data:
►Center:
►Where are the data values concentrated?
►What seem to be typical or middle data values? Is
there central tendency?
►Variability
►How much dispersion is there in the data?
►How spread out are the data values?
►Are there unusual values?
►Shape
►Are the data values distributed symmetrically?
Skewed? Sharply peaked? Flat? Bimodal?

2
Measures of Central Tendency

3
Measures of Central Tendency
► Central tendency is a single value used to describe the center
point of a data set.
Central Tendency

Mean Median Mode

Weighted
Mean

4
Mean
► The mean, or average is the most common measure of central
tendency.

► Calculate the mean by adding all the values in a data set and
the diving the results by the number of observations.

5
Mean
► For the population:
N

åx i
x1 + x 2 + ! + x N Population values
μ= i=1
=
N N
Population size
► For the sample:

n
Observed values
åx i
x1 + x 2 + ! + x n
x= i=1
=
n n
Sample size

6
Mean
► Suppose a sample size n=10 gives the following values:
4 3 8 10 0 2 3 7 5 2

► The sample mean is

4 + 3 + 8 + 10 + 0 + 2 + 3 + 7 + 5 + 2
"̅ =
10
44
= = 4.4
10

7
Mean
► Advantages:
► Easy to calculate
► Summarizes the data with a single value

► Disadvantages:
► Affected by outliers (values that are much higher or
lower than most of the data)

8
Mean
► Affected by outliers (values that are much higher or lower than
most of the data)

0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1 + 2 + 3 + 4 + 10 20
= =4
5 5
1 + 2 + 3 + 4 + 5 15
= =3
5 5
Mean = 3

0 1 2 3 4 5 6 7 8 9 10
9
Weighted Mean
► A weighted mean allows you to assign more weight to
certain values and less weight to others.

► Formula for the weighted mean:

∑(%&'(*% "% )
"̅ =
∑(%&' *%

► *% = the weight for each data value x-


► ∑(%&' *% = the sum of all the weights.
Weighted Mean
► A GPA is a weighted average that gives greater weight to
courses that earn more credits. Grade ranges from 0 to 4. A’s
grade points are 4.0 in Science, which is worth 4 credits, 3 in
English, which is worth 3 credits, and 3.5 in Physics, which is
worth 2 credits. What is A’s GPA?
► Solutions:
► Multiply each grade by its weight. Add the products
and then divide by the sum of the weights.

4×4 + 3×3 + 3.5×2 32


"̅ = =
4+3+2 9

= 3.56
► A’s grade point average is approximately 3.56.

11
Questions
► You invest in stock market, the returns are strongly dependent
on market conditions. The details as follows:

Stock returns
Market condition Likelihood Returns
Up 45% 20%
Neutral 20% 14%
Down 35% -5%

► Calculate expected return for this investment?

12
The Median

13
Median
► The median is the value in the data set for which half
observations are higher and half the observations are
lower.

► In other words, the median (M) is the 50th percentile or


midpoint of the ordered sample data.

14
Median
► Not Affected by outliers

0 1 2 3 4 5 6 7 8 9 10

Mean = 3
Median = 3

0 1 2 3 4 5 6 7 8 9 10
15
Median
► The location of the median:

n +1
Median position = position in the ordered data
2
Where n is the number of observations.

► If the number of values is odd, the median is the


middle number.
► If the number of values is even, the median is the
average of the two middle numbers.

► Note that (n+1)/2 is not the value of the median, only the
position of the median in the ranked data

16
Median
► Suppose a sample size n=9 with order gives the following
values:
0 3 4 7 9 12 13 17 25

► The median position is

9+1
=5
2

► The median value is 9

17
Median
► Suppose a sample size n=10 with order gives the following
values:
0 3 4 7 9 12 13 17 25 4000

► The median position is

10 + 1 Not affected
= 5.5
2 by outliers

► The median value is


9 + 12
= 10.5
2

18
The Mode

19
Mode

►The mode is the value that appears most often in a data


set.
►If no data value or category repeats more than once,
we say that the mode does not exist.

►More than one mode can exist if two or more values


tie for the most frequent.

►The mode is most useful for discrete or categorical data


with only a few distinct data values. For continuous data or
data with a wide range, the mode is rarely useful.

20
Mode

0 1 2 3 4 5 6

No Mode

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

21
Questions

► Which measure of location is the “best”? Mean or


Median? Mode?

22
Shape of a Distribution
►Describes how data are distributed
►Measures of shape
►Symmetric
►Skewed (Left or Right)
Mean < Median Mean = Median Median < Mean

Left-Skewed Symmetric Right-Skewed

23
Measures of Variability

24
Frequency Distribution
Variation

Range Variance Standard deviation

►Measures of variation give


information on the spread
or variability of the data
values. Same center,
different variation
25
Range

►Simplest measure of variation

►Difference between the largest and the smallest


observations.
Range = Highest Value – Lowest Value

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

26
Range
►Advantages
►Easy to calculate and understand
►Disadvantages
►Only based on two numbers on the dataset. Ignores
the way in which data are distributed

►Sensitive to outliers

27
Variance

►Sample variance is denoted by s2

►Where "̅ = sample mean

►n = sample size

►Xi = ith value of the variable X

28
Variance

►A sample includes 6 observations:


10 12 14 15 17 18 18 24
n=8, Mean "̅ = 16

►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
)',
=
2
= 18

29
Variance

►Population variance is denoted by

►Where ! = population mean

►N = population size

►Xi = ith value of the variable X

30
Standard Deviations
►Standard deviation is the square root of variance

►Has the same units as original data

►Sample standard deviation:

31
Standard deviation

►A sample includes 6 observations:


10 12 14 15 17 18 18 24
n=8, Mean "̅ = 16

►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /()3+),). /('0+),).
3+)
)',
=
2
= 18
Standard deviation: & = 18 = 4.24
A measure of how far on average each data value is
from the mean of the sample.
32
Comparing SD

33
Excel and STATA
► Excel Formulas
► Mean: =Average(Data Values)
► Weighted Mean: =Sumproduct(X1, X2)
► Median: =Median(Data Values)
► Mode: =Mode(Data Values)
► Variance: = Var.S(Data Values)
► SD: =STDEV.S(Data Values)

34
Excel
► Excel Tools: Data/Data Analysis/Descriptive Statistics

35
Excel

36
STATA
► Using command “sum”

37
Using the Mean and Standard
Deviation Together

38
Coefficient of Variation

►The standard deviation is affected by the scale of the data


►When sample means are very different, comparing
SD can be misleading.

►The coefficient of variance, CV, measures SD in terms of its


percentage of the mean.
►A high CV indicates high variability relative to the
size of the mean.
►A low CV indicates low variability relative to the size
of the mean.

►A smaller coefficient of variation indicates more


consistency within a set of data values.

39
Coefficient of Variation

►Sample coefficient of variation

$
!" = (100)

s=the sample standard deviation


&̅ = the sample mean

►Population coefficient of variation


+
!" = (100)
,
+ =the population standard deviation
,= the population mean

40
Coefficient of Variation example
►Stock market

Price for Stock A Price for Stock B

Mean 100 60

SD 20 15

CV =20/100*100 =15/60*100
=20% =25%

►Although stock A has a larger deviation, the price is more


consistent.

41
Z-score

►Zscore identifies the number of standard deviations a


particular value is from the mean of its distributions.

►A Zscore has no unit.

►Zscore is
►Zero for values equal to the mean
►Positive for values above the mean
►Negative for values below the mean

42
Z-score formula
►Sample Zscore

# − #̅
!=
&
x=the data value of interest
s=the sample standard deviation
#̅ = the sample mean
►Population Zscore

#−'
!=
(
( =the population standard deviation
'= the population mean

43
Unusual observations

►Based on its standardized Zscore, a data value is classified


as:
►Unusual if |Z| > 2 (beyond μ ± 2σ)
►Outlier if |Z| > 3 (beyond μ ± 3σ)

44
Z-score example
►Price for a glass of milk tea (size L) in Vietnam

15K 28K 43K 50K 70K 90K


►Average price: Mean=49.33
►SD = 27.40
►How far is the price of KOI THÉ (90K) from the sample
mean of 49.33 (in SD increments)

45
Z-score example
►A price for a glass of milk tea (size L) in Vietnam

15K 28K 43K 50K 70K 90K


►!"#$%& = ()* − ,). ..)/12. ,* = 3. ,4
►The price of KOI is more than one standard deviation (1.48)
above the sample mean.

46
Empirical rule

►If the data distribution is bell-shaped, symmetrical curve


centered around the mean, we would expect:
About 68% of the values in About 99.7% of the values in
the population to fall within ± the population to fall within ±
1 standard deviation from the 3 standard deviation from the
mean mean

68% 95% 95% 99.7%99.7%

μ
μ ± 2σμ ± 2σ μ ± 3σμ ± 3σ
μ ± 1σ
About 95% of the values in
the population to fall within ±
2 standard deviation from the
mean
47
Chebychev’s Theorem
► For any population with mean μ and standard deviation σ , and k >
1 , the percentage of observations that fall within the interval
[" + $%]

► is at least
)
1− ×100
*+

► Regardless of how the data are distributed, at least (1 - 1/k2) of the


values will fall within k standard deviations of the mean (for k > 1)

48
Grouped Data

49
Grouped Data

► Suppose data are grouped into K classes, with frequencies f1, f2,
. . . fK, and the midpoints of the classes are m1, m2, . . ., mK.
► For a sample of n observations, the mean is

n: the total number of observations


k: the number of classes

► This mean is only an approximate value since the midpoint is just


an estimate of the value in each class.

50
Grouped Data

► The following table gives the frequency distribution of the


number of orders received each morning during the past 30
mornings at a coffee store. Calculate the mean.
Number of orders Frequency (f)
0-4 4
5-9 5
10 - 14 10
15 - 19 7
20 - 24 4
n=30

51
Grouped Data
► Calculate the mean.
► m is the midpoint of the class. It is adding the class limits and
divide by 2.
Number of Frequency Mid-point fm
orders (f) (m)
0-4 4 2 8
5-9 5 7 35
10 - 14 10 12 120
15 - 19 7 17 119
20 - 24 4 22 88
n=30 370

∑ )* ,-.
► !"#$ &̅ = + = ,. = 12.33
► The average number of orders is about 12.33
52
Grouped Data

► For a sample of n observations, the variance is

53
Grouped Data
► Calculate the variance.

Number Frequency (f) Mid-point


of orders (m) 0−2 (0 − 2)5 6 ∗ (0 − 2)5

0-4 4 2 -10.33 106.71 426.84


5-9 5 7 -5.33 28.41 142.04
10 - 14 10 12 -0.33 0.11 1.09
15 - 19 7 17 4.67 21.81 152.66
20 - 24 4 22 9.67 93.51 374.04
n=30 1096.67

$%&'.')
► Variance ! " = *%+$
= 37.82

54
Relative Position
Percentiles, Quartiles, and Box Plots

55
Measures of relative position

► Measures of relation position compare the position of one


value in relation to other values in the data set.

► Measures:
► Percentiles
► Quartiles
► Interquartile

56
Percentiles

► Percentiles are data that have been divided into 100


groups.

► For example, you EQ score in the 80rd percentile on a


standardized test. That means that 80% of the test-takers
scored below you.

► Generally, the pth percentile of a data set (where p is any


number between 1 and 100) is the value that at least p
percent of the observations will fall below.

57
Percentiles

58
Percentiles

► To find percentiles manually:


► Sort the data from lowest to highest
► Calculate the position, I
#
!= & where p = the percentile of interest
$%%
n = number of data values.

► If i is not a whole number, round i to the next whole


number, the ith position represents our value of interest.

► If i is a whole number, the midpoint between ith and i+1th


position is our value of interest.

59
Quartiles
► Quartiles are scale points that divide the sorted data into four
groups of approximately equal size.

► The first quartile, Q1, is the value for which 25% of the observations are
smaller and 75% are larger.
► Q2 is the same as the median (50% are smaller, 50% are larger)
► Only 25% of the observations are greater than the third quartile.
Questions?
► What is the median of the data values below Q2?
► What is the median of the data values above Q3?
Interquartile

► Interquartile range describes the middle 50% of the data.

► Can Eliminate high- and low-valued observations (outliers)

► Interquartile range = 3rd quartile – 1st quartile


IQR = Q3 – Q1
Box-and-Whisker plot
► A box plot (also called a box-and-whisker plot) is a graphical
display showing the relation position of the five-number
summary:
Min , Q1 , Q2 , Q3 , Max
► It also provides outlies if any

**

Outliers
Outliers
►Formulas for the upper and lower limits of outliers

►Upper Limit = Q3 + 1.5 IQR


►Lower Limit = Q1 - 1.5 IQR

►Values beyond these limits are considered outliers

64
Examples

65
STATA

-.4 -.2 0 .2 .4
ROA

66
Association Between Two Variables
(Covariance , Correlation)

67
Covariance
► The covariance measures the direction of the linear relationship between two
variables.
► The population covariance:

► The sample covariance:


n

å (x - x)(y - y)
i i
Cov (x , y) = s xy = i=1
n -1
► Only concerned with the direction of the relationship (positive, negative, no
relationship)
► No causal effect is implied

68
Covariance

► Covariance between two variables:

► Cov(x,y) > 0 x and y tend to move in the same direction

► Cov(x,y) < 0 x and y tend to move in opposite directions

► Cov(x,y) = 0 x and y are independent

69
Correlation Coefficient
► The sample correlation coefficient, rxy measures both the strength and
direction of the linear relationship between two variables.

► Formula for population correlation coefficient:

Standard
deviations for the
► Formula for sample correlation coefficient: x or y variable

Cov (x , y)
r=
sX sY

70
Correlation Coefficients
► Unit free

► Ranges between –1 and 1

► The closer to –1, the stronger the negative linear relationship

► The closer to 1, the stronger the positive linear relationship

► The closer to 0, the weaker any positive linear relationship

71
Correlation Coefficients

72
Excel
► Covariance for the sample:

=COVARIANCE.S(X DATA VALUES, Y DATA VALUES)

► Correlation for the sample:


=CORREL (X DATA VALUES, Y DATA VALUES)

Excel Tools: Data/Data Analysis/Covariance


Data/Data Analysis/Correlation

73
Skewness

1 4
% 2. − 2̅
!"#$%#&& = -
(% − 1)(% − 2) &
./0

74
Kurtosis
► Kurtosis refers to the relative length of the tails and the degree of
concentration in the center.
► A normal bell-shaped population is called mesokurtic and serves
as a benchmark.

► A population that is flatter than a normal population (i.e., has


heavier tails) is called platykurtic, while one that is more sharply
peaked than a normal population (i.e., has thinner tails) is
leptokurtic.

► Kurtosis is not the same thing as variability, although the two are
easily confused.
► A histogram is an unreliable guide to kurtosis because its scale
and axis proportions may vary, so a numerical statistic is needed:

75
Kurtosis

76
Exercise

► Review Session 3, Online Quiz 3.

► Homework, Group Assignment

► Reading Chapter 4. Probability

77

You might also like