Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Statistics

Lecture 6
Descriptive Statistics

1. ANALYSIS OF LOCATION (CENTRAL TENDENCY)


where along the scale of all possible values our particular distribution
happens to be centered (mean, median, mode)
The purpose - to describe population in one figure a representative value of a mass of data.

2. ANALYSIS OF DISPERSION (VARIATION)


how the data varies (it always should complement measures of location
– because eg. mean, when appear alone can be very misleading)

3. ANALYSIS OF SKEWNESS (SYMMETRY)


whether the distribution is symmetric or skewed

4. ANALYSIS OF CONCENTRATION
ANALYSIS OF CONCENTRATION –
Distribution of the total value between the elementary units -
whether the total value of the variable is uniformly distributed
between the elementary units or not
ANALYSIS OF KURTOSIS(PEAKEDNESS)
whether the distribution is mesokurtic, leptokurtic or platykurtic
Concentration of elementary units near the mean value
Descriptive Statistics – MEASURES OF CONCENTRATION

The analysis of concentration can be presented by calculating


The Gini coefficient / the Lorenz curve
10 people working in a firm (they together earn 100 euros)
cumulative cumulative
no income no income
income income
100
1 10 10 100 1 1 1
90
2 10 20 90 2 2 3
80
3 10 30 80 3 4 7 70
4 10 40 70 4 6 13 60
5 10 50 60 5 8 21 50
6 10 60 50 6 10 31 40
7 10 70 40 7 12 43 30
8 10 80 30 8 14 57 20
20 9 16 73 10
9 10 90
10 10 27 100 0
10 10 100
0 0 1 2 3 4 5 6 7 8 9 10
total 100 total 100
1 2 3 4 5 6 7 8 9 10

no income
cumulative the Lorenz curve
income 100
1 0 0
2 0 0
90
80
A - the area from the
3 0 0 70
diagonal to the real
4 0 0 60 income distribution
5 0 0 50
B - the area from the
6 0 0 40
30
curve to the axis
7 0 0
𝐴
8 0 0 20
𝐺= –
9 5 5 10 𝐴+𝐵
10 95 100 0 𝐺 =0 - complete equality
1 2 3 4 5 6 7 8 9 10
total 100 𝐺 closer to 0 - income is distributed more evenly
𝐺 closer to 1 - income is distributed more unevenly
Descriptive Statistics – MEASURES OF CONCENTRATION

The Gini coefficient


• compares the Lorenz curve of a ranked empirical distribution with the line of perfect equality
(assumes that each observation has the same contribution to the total summation of the values of all the observations),
• ranges between 0 and 1
0 - no concentration (perfect equality), 1 - the total concentration (perfect inequality)
• measures the degree of concentration (inequality) of a variable in a distribution of its elements or
measures the deviation from perfect equality - the further a Lorenz curve differ from the perfectly equal
straight line (Gini coefficient = 0), the higher the Gini coefficient and the less equal the society
• assess income distribution among a set of regions (or countries) or other spatial phenomena such as
industrial location

https://www.economicsonline.co.uk/Definitions/Gini_co-efficient.html https://towardsdatascience.com/clearly-explained-gini-coefficient-and-lorenz-curve-fe6f5dcdc07
Descriptive Statistics – MEASURES OF CONCENTRATION

The analysis of concentration can be presented by calculating


The Gini coefficient / the Lorenz curve

World map of income inequality Gini coefficients by country (as %). Based on World Bank data ranging (2021)

the lowest score on The Gini coefficient ~0.2- low degree of inequality
the highest score over 0.6 - very unequal the income distributions
Descriptive Statistics – MEASURES OF CONCENTRATION

The analysis of concentration can be presented by calculating


The Gini coefficient / the Lorenz curve

the lowest score on The Gini coefficient ~0.2- low degree of inequality
the highest score over 0.6 - very unequal the income distributions
Descriptive Statistics – MEASURES OF CONCENTRATION

The analysis of concentration can be presented by calculating


The Gini coefficient / the Lorenz curve
by provinces by districts

Europe

Vietnam

the lowest score on The Gini coefficient ~0.2- low degree of inequality
the highest score over 0.6 - very unequal the income distributions
Descriptive Statistics – MEASURES OF CONCENTRATION

S80/S20 ratio
The "income quintile share ratio" (also called the „S80/S20 ratio” „ 20/20 ratio”) –
the ratio of the total income received by the 20% of the population with the highest income (= 1st or top quintile)
to that income received by the 20% of the population with the lowest (= 5th or bottom quintile).
or
the annual income of the top 20% of the population expressed in the number of years the lowest 20%
of the population have to work in order to achieve the same income result.

POLAND - for 2015, the share of


disposable income of the upper 20%
of the population was around 5
times greater than that of the lowest
20% of the population.

2015 disposable income - known as


disposable personal income (DPI) -
amount of money that an individual or
household has to spend or save
after income taxes have been deducted
Descriptive Statistics - INFORMATION ON CENTRE AND VARIATION

BOXPLOTS - Box – and – whisker diagram


five numbers summery of data set -
min, max and 3 quartiles - provide information on centre and variation -> whether a distribution is skewed

split the box in two covers the interquartile interval (with 50% of the data)

X - sometimes, the mean is indicated by a dot or a cross on the box plot

If the whisker to the right of the box is longer than the


one to the left,
there is more extreme values towards the positive -
distribution is positively skewed

Diagram was invented by John Tukey


Descriptive Statistics - INFORMATION ON CENTRE AND VARIATION

MODIFIED BOXPLOTS
outliers may be the result of:
• a measurement error,
• an observation from a different population
• an unusual extreme observation
• it may instead be an indicator of skewness.

Usually we use quartiles and the interquartile range – IQR - to identify potential outliers.

We define the lower and upper limits, the numbers that lie, respectively:
Lower limit (fence) – 1.5 IQRs below the first quartile = Q1 -1.5 IQR
Upper limit (fence) – 1.5 IQRs above the third quartile = Q3 +1.5 IQR
Descriptive Statistics - INFORMATION ON CENTRE, VARIATION AND SKEWNESS

MODIFIED BOXPLOTS

Measurement Distribution
A B C
A Minimum 0.00 0.11 0.14
Lower quartile (Q1) 0.02 0.37 0.69
Median (Q2) 0.11 0.48 0.88
Upper quartile (Q3) 0.32 0.58 0.95
Maximum 0.86 0.93 1.00
B

INFORMATION ON CENTRE
The centre of distribution A is the lowest of the 3 distributions (median is 0.11).
The centre of distribution C is the highest of the three distributions (median is 0.88).
INFORMATION ON VARIATION
A - the interquartile range is Q3 - Q1 = 0.32 – 0.02 = 0.30
B - the interquartile range is Q3 - Q1 = 0.21 - The most concentrated distribution because the interquartile range is 0.21, compared to
0.30 for distribution A and 0.26 for distribution C.
C - the interquartile range is Q3 - Q1 = 0.26
A, B, C include potential outliers.
INFORMATION ON SKEWNESS
A - the distribution is positively skewed - the whisker and half-box are longer on the right side of the median than on the left side.
B – the distribution is approximately symmetric - both half-boxes are almost the same length (0.11 on the left side and 0.10 on the right side).
C - the distribution is negatively skewed because the whisker and half-box are longer on thehttps://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-eng.htm
left side of the median than on the right side.
Descriptive Statistics - BOXPLOTS - in Excel

Insert > Insert Statistic Chart >Box and Whisker

https://www150.statcan.gc.ca/n1/edu/power-pouvoir/ch12/5214889-eng.htm
Descriptive Statistics – DESCRIPTIVE ANALISYS in Excel

Data > Analysis|Data Analysis and choose the Descriptive Statistics option

A dialog box ->

Statistic Description
Mean the arithmetic mean of the sample data.
Standard Error the standard error of the data set (a measure of the
difference between the predicted value and the actual value).
Median the middle value in the data set (the value that separates the largest
half of the values from the smallest half of the values).
Mode the most common value in the data set.
Standard Deviation the sample standard deviation measure for the data set.
Sample Variance the sample variance for the data set (the squared standard dev.).
Kurtosis the kurtosis of the distribution.
Skewness the skewness of the data set’s distribution.
Range the difference between the max and min values in the data set.
Minimum the smallest value in the data set.
Maximum the largest value in the data set.
Sum Adds all the values in the data set together to calculate the sum.
Count Counts the number of values in a data set.
Largest(X) the largest X value in the data set.
Smallest(X) the smallest X value in the data set.
Confidence Level(X) the confidence level at a given percentage for the data set values.
Percentage
Kurtosis: Skewness:
= 0 - Mesokurtic (normal) distribution =0 - the distribution is symmetric
> 0 - Leptokurtic distribution - more peaked than the normal one. < 0 - the distribution is skewed to the left (negatively skewed)
< 0 - Platykurtic distribution - flattered than the normal one > 0 - the distribution is skewed to the right (positively skewed )
Descriptive Statistics - MOMENTS OF THE DISTRIBUTION

In statistics we often talk about the moments of the distribution.


The term moment has its origin in mechanics where the term “moment of a force” is used.
Statistical moments are analogous to these physical moments, except that …
…..in statistics we talk about moments per unit of frequency within the distribution.

Definition:
Moments of a distribution
The kth moment of a distribution is the average of the deviations of the individual
observations from any value x0 to the power k.

In general the kth moment is defined as follows:


mk =
 ( xi − x 0 ) k
N

In the case of grouped data the formula will be changed into:

 ( x i − x 0 ) ni  ( x'i − x0 ) ni
k Grades – xi Number of k
Wages Number of
grades - ni
mk = 2 0 mk = xi
0-6
employees - ni
3
N 3 3 N 6 - 12 4
4 10
5 3 12 - 18 13
Total 16 Total 20
Descriptive Statistics - MOMENTS OF THE DISTRIBUTION

We distinguish:
- the moments about the origin x0 =0
- the moments about the mean x0 =m.

The moments about the origin x0 =0


mk =
 (x i − x0 ) k x0 =0
N


k
x
mk = i

N
In the case of grouped data the formula will be changed into:

 x i ni
k Grades – xi Number of
mk =
 i ni
x ' k
Wages Number of

mk =
grades - ni xi employees - ni
2 0
N 3 3 N 0-6 3
4 10 6 - 12 4
5 3 12 - 18 13
Total 16 Total 20

It should be noted that the first moment about the origin is simply equal to the mean.
Descriptive Statistics - MOMENTS OF THE DISTRIBUTION
.
The moments about the mean x0 =m.

The 1st central moment

detail data  = =
 (x − x)
i
=0 grouped data  = 
( xi − x )ni
= 0 1 =
 ( x' − x ) n
i i
0
1 1 1
N N N
Since the sum of deviations of observations from their mean is zero. xi → x'i
The 2nd central moment - the variance of the X

 ( xi − x )2 grouped data  2 =
 (x − x ) n 2

= Sx  =
 ( x' − x ) n 2

= S x2
 =
i i
= S x2
2 i i
detail data 2
2
N N N

For odd values of r some term in the sum must be positive and some must be negative.
The 3rd central moment In fact for symmetric distributions the positive and negative terms cancel out

detail data  =  i
( x − x ) 3

grouped data  3
=
 (x − x ) n
i
3
i
 3
=
 ( x' − x ) n
i
3
i

3
N N N

standardised measure of skewness  =
3
3
−   3   (-2;2)
S x3
= 0 - the distribution is symmetric
 < 0 - the distribution is skewed to the left (negatively skewed)
 > 0 - the distribution is skewed to the right (positively skewed )
Descriptive Statistics - MOMENTS OF THE DISTRIBUTION
.
The moments about the mean x0 =m.

The 4th central moment

detail data  =  i
( x − x ) 4

grouped data  =  i
( x − x ) 4
ni
 4
=
 ( x '− x ) n
i
4
i

4 4
N N N

 =
4
4
4 standardised measure of kurtosis
sx

MESOKURTIC (normal) DISTRIBUTION - The distributions, which are of the same


concentration about the mean value as the normal distribution

LEPTOKURTIC DISTRIBUTION - The distributions, which are of a higher (greater)


concentration about the mean value than the normal distribution.
The distributions, which are more peaked than the normal one.

PLATYKURTIC DISTRIBUTION - The distributions, which are of a lower (smaller)


concentration about the mean value than the normal distribution.
The distributions, which are flattered than the normal one.
Example
The Radio-Taxi company conducted a taxi survey due to the weekly number of kilometers traveled.
Using the statistical moments method present a complete descriptive statistics of the population

Przebieg Odsetek
Mileage
w tys. kmkm
thousand
the percentage
taksówek
of taxis xi, x i, n i
0.8 - 1.2 2 1 2
1.2 - 1.6 15 1,4 21
1.6 - 2.0 41 1,8 73,8
2.0 - 2.4 33 2,2 72,6
2.4 - 2.8 6 2,6 15,6
2.8 - 3.2 3 3 9
suma
total N =100 194

measures of central tendency


The Mean
k

xn '
i i
194
x= i =1 = = 1,94
N 100

The interpretation:
the average mileage of taxi amounts to 1.94 thous.km
Example
The Radio-Taxi company conducted a taxi survey due to the weekly number of kilometers traveled.
Using the statistical moments method present a complete descriptive statistics of the population

Przebieg Odsetek
( xi, − x ) 2 ni
Mileage the percentage
w tys. km
thousand km taksówek
of taxis xi, x i, n i
0.8 - 1.2 2 1 2 1,77
1.2 - 1.6 15 1,4 21 4,37
1.6 - 2.0 41 1,8 73,8 0,8
2.0 - 2.4 33 2,2 72,6 2,23
2.4 - 2.8 6 2,6 15,6 2,61
2.8 - 3.2 3 3 9 3,37
suma
total N =100 194 15,16

measures of dispersion Interpretation: The mileage of taxi differ from the mean on
the average for about 0.4 thous.km.

 = S ( x) =
2  ( x' − x ) n
i
2
i
=
15,16
= 0,1516 S( x ) = 0,1516 = 0,4
2
N 100
THE COEFFICIENT OF VARIATION THE TYPICAL RANGE

Vx =
Sx
 100% =
0,4
 100% = 20% 1,54  xtyp  2,34
x 1,94 Interpretation:
A typical of taxi mileage ranged from 1.54 till 2.34 thous.km.
Interpretation: The standard deviation of the mileage Or
of taxis constitutes over 20% of the mean About 68% of of taxi mileages ranged from 1.54 till 2.34 thous.km.
Example
The Radio-Taxi company conducted a taxi survey due to the weekly number of kilometers traveled.
Using the statistical moments method present a complete descriptive statistics of the population

Przebieg Odsetek
( xi, − x ) 2 ni ( xi − x ) ni
Mileage the percentage
xi, , 3
w tys. km
thousand km taksówek
of taxis x i, n i
0.8 - 1.2 2 1 2 1,77 -1,66
1.2 - 1.6 15 1,4 21 4,37 -2,36
1.6 - 2.0 41 1,8 73,8 0,8 -0,11
2.0 - 2.4 33 2,2 72,6 2,23 0,58
2.4 - 2.8 6 2,6 15,6 2,61 1,73
2.8 - 3.2 3 3 9 3,37 3,57
suma
total N =100 194 15,16 1,742

measures of skewness

The analysis of skewness can be presented by calculating the 3rd moment about the mean

 =  i ni
( x ' − x ) 3

=
1,742
= 0,01742  = 3
 3
0,01742
= = 0,3
3 3 3
N 100 s ( x) 0,4
=(-2;2) = 0 - the distribution is symmetric
Interpretation:  < 0 - the distribution is skewed to the left (negatively skewed)
 > 0 - the distribution is skewed to the right (positively skewed )
3= 0.3 - the distribution is slightly skewed to the right
The mean value is a good measure of central tendency although more units is smaller than the average indicates.
Example
The Radio-Taxi company conducted a taxi survey due to the weekly number of kilometers traveled.
Using the statistical moments method present a complete descriptive statistics of the population

Przebieg Odsetek
( xi, − x ) 2 ni ( xi − x ) ni ( xi, − x) 4 ni
Mileage the percentage , 3
thousand km
w tys. km taksówek
of taxis xi, x i, n i
0.8 - 1.2 2 1 2 1,77 -1,66 1,56
1.2 - 1.6 15 1,4 21 4,37 -2,36 1,28
1.6 - 2.0 41 1,8 73,8 0,8 -0,11 0,02
2.0 - 2.4 33 2,2 72,6 2,23 0,58 0,15
2.4 - 2.8 6 2,6 15,6 2,61 1,73 1,14
2.8 - 3.2 3 3 9 3,37 3,57 3,79
suma
total N =100 194 15,16 1,742 7,929

measures of concentration (of kurtosis)


The analysis of kurtosis can be presented by calculating the 4th moment about the mean

 =  i ni
( x ' − x ) 4

=
7,929
= 0,0793  =
 4
0,07923
= = 3,45
4 4 4 4
N 100 s ( x) 0,4

Interpretation:
a4= 3.45 - the distribution is leptocurtic
The distribution, which is more peaked than the normal one.

You might also like