Lecture 2b - Descriptive Statistics II

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 72

Descriptive Statistics II

NUMERICAL MEASURES
Numerical Measures in Statistics
• Measures of central tendency / location

• Measures of Distribution shape

• Measures of Dispersion / variation


Numerical Measures
• Measures that are computed from data within a sample are called
sample statistics

• Measures that are computed from data within a population are called
population parameters

• A sample statistic is referred to as a point estimator of the


corresponding population parameter
Measures of Location
Measures of Location
• Mean
• Median
• Mode
• Weighted Mean
• Geometric Mean
• Percentiles
• Quartiles
Mean
• The mean of a data set is the average of all the data values

• It provides a measure of central location

• The sample mean 𝑥̅ is the point estimator of the population


mean 𝜇
Mean
The mean for ungrouped data is calculated by dividing the sum of
all values by the number of values in the dataset.
Where:
!"!
Population mean = 𝜇 = 𝜮𝒙𝒊 is the sum of all the data points
#
N is the population size
!"!
Sample mean 𝑥̅ = n is the sample size
"
𝜇 is the population mean

𝑥̅ is the sample mean


Example 1

• Find the mean marks obtained by students of ECON 1005.

10 20 36 92 95 40 50 56 60 70
92 88 80 70 72 70 36 40 36 40
92 40 50 50 56 60 70 60 60 88
Solution

The average score for


students in ECON 1005
was 59.3
Outliers

• Often times, the presence of outliers can affect our calculations


of the mean.

• An outlier is a value(s) that is significantly higher or lower than


the majority of values in the dataset

• This can increase or decrease the value of the mean.


Example 2
• The following are the salaries of 10 persons within the finance department
at an Accounting firm.

$3500 $2850
$4200 $3680
$4550 $2730
$14550 $3200
$4050 $3990

Which observation is an outlier?


Find the mean salary with and without the outlier
Solution 1
• The outlier is $14,550
• Mean without the outlier:
$%&''($)*''($)&&' ($)'&'($*+&'($%,+'($*-%'($%*''($%..'
• 𝑥̅ =
#

$'(,*+,
• 𝑥̅ = -

• 𝑥̅ = $3,638.89

• The average salary without the outlier is $3,638.89


Solution 2
• Mean with the outlier
$%&''($)*''($)&&' $/)&&'($)'&'($*+&'($%,+'($*-%'($%*''($%..'
• 𝑥̅ =
$%

$.*-(,
• 𝑥̅ =
"#

• 𝑥̅ = $4,730

• The presence of the outlier therefore increases the value of the


mean.
Median
• The median of a dataset is the value in the middle when the data are
arranged in ascending order.

• The median gives the centre of a histogram with half the values on the
left of the median and half on the right

• The median is not influenced by outliers

• As a result, the median is preferred over the mean as a measure of


location
Example 1:
• The following data are the cricket scores from the Barbados Royals in
the last 7 innings.
• 162 193 98 204 138 115 186

Step 1: Arrange the data points in ascending order

98 115 138 162 186 193 204

Median
Median

• Often times, there are an even number of data points and


therefore there is no obvious “middle” or median.

• The median is found by taking the two middle values and dividing
by 2.
Median
• The following data are the mm of rainfall over a 12 month period
97 40 21 4 74 65 123 34 23 48 3 18
Arranging in ascending order
Since there are 12 values, there would be 2 values found in the middle
3 4 18 21 23 34 40 48 65 74 97 123

Median

!"#"$ &"
Median is found by %
= %
= 37
Median
• Often times, a dateset may be too large to identify the middle
straight away
• In such a case, we use a general formula to find the midpoint as
follows.
0(/
• For an odd number of cases: Median = th term
*
! !
12 1345 (( (/) 12 1345
" "
• For an even number of cases: Median = *
• For example: If a dataset has 500 cases the median would be
*&'89 8:;<(*&/=8 8:;<
*
Mode
• The mode of the dataset is the value that occurs the most often.
• The greatest frequency can occur in more than one value
• If a dataset has exactly two modes, the data are bimodal.
• If a dataset has more than two modes, the data are multimodal.
• Some data may have no mode, if each data point occurs only
once.
Mode Example
The data on right shows the monthly rents
of seventy apartments.
Mode, Median, Mean
• The mode can be calculated on both quantitative and qualitative
data. The mean and median can only be calculated on quantitative
data.
• A data set can have zero or more than one mode, but there can only
be one mean or median.
Trimmed Mean
• The trimmed mean is a measure for calculating the mean when
extreme values are present.
• To obtain the trimmed mean, we delete a given percentage of the
largest and smallest values.
• The mean is the calculated on the remaining values.
• For example, the 5% trimmed mean is obtained by deleting the
smallest 5% and the largest 5% of values and calculating the
mean on the remaining values.
• In order to calculate the trimmed mean, the values must be
arranged in ascending order before removing the largest and
smallest values.
Example
• The following data are test scores from a Spanish test of 19 students:
72, 99, 98, 76, 92, 45, 91, 91, 85, 90, 87, 88, 85, 85, 80, 79, 67, 66, 87
Find the 5% trimmed mean.

Since there are 19 test scores, 5% of 19 = 19(0.05) = 0.95


We round this number to 1 and take off the highest and lowest value
after ranking the data in ascending order.
45, 66, 67, 72, 76, 79, 80, 85, 85, 85, 87, 87, 88, 90, 91, 91, 92, 98, 99
We can then find the mean of the remaining values
Trimmed Mean - solution
,,(,-(-*(-,(-.(+'(+&(+&(+&(+-(+-(++(.'(./(./(.*(.+

/-

/)/.
• = 83.5
/-

• Note we are now dividing by 17


Weighted Mean

• In some instances, the mean is computed by giving each


observation a weight that reflects its relative importance.

• The choice of weight depends on the application

• For example - The weights might be the number of credit hours


earned for each grade as in GPA.
Weighted Mean
!"!>!
• Weighted Mean ( 𝑥̅ ) =
!>!

• Where
• xi = value for observation i
• wi = weight for observation i
• i.e numerator = sum of weighted data values
• denominator = sum of weights

• If the data is from a population, then 𝜇 replaces 𝑥̅


Weighted Mean Example
• A contractor is going through the expenses for a house he just
completed. For the purpose of pricing future projects, he wants
to know the average wages per hour he paid the workers he
employed. Listed below, are the categories of workers he
employed, along with their respective wage and total hours
worked.
Worker Wage per hour Total Hours
Carpenter $21.60 520
Electrician $28.72 230
Labourer $11.80 410
Painter $19.75 270
Plumber $24.16 160
Weighted Mean
Worker x w xw
!"# Carpenter 21.60 520 11232.0
𝑥̅ =
!#
Electrician 28.72 230 6605.6
$%&'$.' Labourer 11.80 410 4838.0
= %)*+
Painter 19.75 270 5332.5
= 20.0464 Plumber 24.16 160 3865.6

= $20.05 1590 31873.7

The contractor paid an average of $20.05


per hour in wages on the house
Geometric Mean
• The geometric mean is calculated by finding the nth root of the product
of n values.

• It is often used in analyzing growth rates in financial data (where using


the arithmetic mean will provide misleading results).

• It should be applied anytime you want to determine the rate mean of


change over several successive periods (be it years, quarters, weeks …)

• The formula is given by:


Geometric Mean
• Example: Rate of Return Growth
Period Return (%) Factor
1 -6.0 0.940
2 -8.0 0.920
3 -4.0 0.960
4 2.0 1.020
5 5.4 1.054

= [(0.940)(0.920)(0.960)(1.020)(1.054)]1/5

= [0.89254]1/5

= 0.97752

Average growth rate per period is (0.97752-1)(100) = -2.248%


Percentiles
• A percentile provides information about how the data are spread
over the interval from the smallest value to the largest value.

• Admission test scores for colleges and universities are frequently


reported in terms of percentiles.

• The pth percentile of a dataset is a value such that at least p


percent of the items take on this value or less and at least (100-p)
percent of the items take on this value or more.
Percentiles
• In order to assess percentiles:

• Arrange the data in ascending order


• Compute the location(Lp) of the pth percentile as follows

• Lp = ( p / 100 ) (n + 1)
80th Percentiles
• Example: Apartment Rents
• Recall we had 70 monthly rents. We are trying to find the 80th
percentile of those values.
• Using the formula:
• Lp = ( 80 / 100 ) (70 + 1) = 56.8
• This indicates that the 80th percentile is the 56th value plus 0.8 times the
difference between the 56th and 57th values.
• 80th percentile = 635 + 0.8(649-635)
• = 646.2
80th Percentile
What does this value state?

• 80% of the values take a value of 646.2 or less

AND

• At least 20% of the items take on a value of 646.2 or more


Quartiles

• Quartiles are specific percentiles.

• First quartile = 25th percentile

• Second quartile = 50th percentile = median

• Third quartile = 75th percentile


Quartiles (75th Percentile)
• Example: Apartment Rents
• Lp = ( p / 100)(n + 1)
• = (75/100)(70+1)
• = 53.25 (the 53rd value plus 0.25 times the difference between the
54th and 53rd value)
• Third quartile = 625 + 0.25(625-625) = 625
Interquartile
range
• The interquartile range is the
difference between the 3rd
quartile and the 1st quartile.

• Where the range gives you


the spread of the whole
dataset, the interquartile
range gives the range of the
middle half of the dataset.
Measures of Distribution Shape
Relationship between Mode, Median,
Mean
• In a symmetric histogram and frequency distribution with one peak,
the values of the mode, median and mean are identical and lie at the
centre of the distribution.
Relationship between Mode, Median,
Mean
• For a histogram and frequency distribution curve that is skewed to the right,
the value of the mean is the largest, the value of the mode is the smallest
and the value of the median lies between the two.

• The mode is always at the peak point.

• The value of the mean is the largest


due to the presence of outliers pulling
the value of the mean to the right.
Relationship between Mode, Median, Mean
• For a histogram and frequency distribution curve that is skewed to the
left, the value of the mean is the smallest, the value of the mode is the
largest and the value of the median lies between the two.

• The value of the mean is the smallest


due to the presence of outliers pulling
the value of the mean to the left.
Measures of Dispersion / Variability
Measures of Dispersion / Variability
• It is often necessary to consider the measures of dispersion
(variability) in addition to the measures of location

• For example, in choosing supplier A or B, we may want to consider


not only the average delivery time for each but also the variability in
delivery time for each.
Measures of Dispersion/Variability
• These measures include
• Range
• Variance
• Standard Deviation
• Coefficient of Variation
Range
• The range of a dataset is the difference between the largest and
smallest values

• Range = largest value – smallest value

• The range is the simpliest measure of variability


Range Example
• The data below show the ages of 8 participants in a study.
37 19 31 29 21 26 33 36

Ordering the data from lowest to highest allows you to quickly


identify the lowest and highest values
19 21 26 29 31 33 36 37

Range = 37 – 19 = 18
The range of ages with the study is 18 years.
Disadvantages of the Range

• The range is influenced by the presence of outliers.

• The range is NOT a good measure of variabilty due to the presence of


outliers.

• The range is nonresistant measure of dispersion.

• The range only uses 2 values to measure dispersion.


Variance & Standard Deviation
• The standard deviation is a measure of variability / dispersion that utilizes all of the data.

• The standard deviation tell us how closely the values of the data are clustered around the
mean.

• The lower the standard deviation, the closer the values are clustered around the mean.

• The larger the standard deviation, the further the values are spread around the mean.

• The standard deviation is obtained by taking the positive square root of the variance.
Variance & Standard Deviation
• The variance calculated for population data is denoted as 𝜎2 (sigma
squared)

• The variance calculated for sample data is denoted as s2.

• The standard deviation calculated for population data is denoted as 𝜎.

• The standard deviation calculated for sample data is denoted as s.


Variance & Standard Deviation

• The variance is based on the difference between the value of


each observation (xi) and the mean

• This difference is known as the deviation about the mean (𝑥̅ for a
sample and 𝜇 for the population)

• For a sample, the deviation is written as ( xi – 𝑥̅ ) for a population,


the deviation is written as (xi - 𝜇)

• In order to compute the variance, the above are squared.


Variance & Standard Deviation Formulas
$ $
' ()* ' ( )(̅
• 𝜎2 = +
and s2 = -).

$ $
' ()* ' ()(̅
• 𝜎= +
and s= -).

Where: 𝜎2 = population variance


s2 = sample variance
𝜎 = population standard deviation
s = sample standard deviation

Any time we have a sample – the denominator is n-1


Variance & Standard Deviation Example
• The following data are the test scores of 10 students
in an english exam. Find the sample standard deviation
and sample variance. Score (x)
20
40
60
60
75
80
70
65
70
90
Solution
• In order to find the standard deviation or variance, we must first find
the mean of the data.
!" ,%'
• 𝑥̅ = ?
= /'

= 63
SOLUTION CONT’d
x x /
x-𝒙 / )2
(x-𝒙
20 63 -43 1849
40 63 -23 529
60 63 -3 9
60 63 -3 9
75 63 12 144
80 63 17 289
70 63 7 49
65 63 2 4
70 63 7 49
90 63 27 729
Σ = 3660
Solution cont’d

• Sample Variance is given by


&
! " @"̅
• s2 =
?@/

'11,
• = -
• = 406.67
Solution cont’d
• Sample Standard Deviation is given by

&
! "@"̅
•s=
?@/

• = 406.67
• = 20.17
Variance & Standard Deviation
x /
𝒙 /
x-𝒙
20 63 -43
• The reason for squaring the deviations
from the individual measures is 40 63 -23
because 60 63 -3
the sum would result in zero and this 60 63 -3
implies that there is no deviation from 75 63 12
the mean.
80 63 17
70 63 7
• Having no deviation from the mean is 65 63 2
not a true measure.
70 63 7
90 63 27
Σ=0
Coefficient of Variation (CV)
• The coefficient of variation is a measure of how large the
standard deviation is in relation to the mean.

• Often times, we want to compare the variability for two different


data sets that potentially have two different units of
measurement.

• We can do this using the coefficient of variation.


The coefficient of variation (CV)
• The coefficient of variation (CV) expresses the standard deviation as a
percentage of the mean.

B
• For population data, the CV = x 100%
C

=
• For sample data, the CV = "̅ x 100%
example
• The mean scores of two students, Sonia and Mark, in 5 subjects are 96
and 92 with a standard deviation of 2.4 and 4.6 respectively. Who is
the more consistenct performer?
• We can answer this question using the coefficient of variation as
follows:

%." ".1
CV for Sonia = 01
x 100% = 2.5% CV for Mark = 0%
x 100% = 5%

Since the CV for Sonia < Mark, Sonia is more consistent.


A higher CV indicates a higher relative variation.
Grouped Data
• Recall – group data we have intervals or ranges as the variables (1-10,
11-20, 21-30 etc).

• Mean of grouped data

• Standard Deviation of Grouped Data

• Variance of grouped data


Mean of grouped data
!<D
• The mean for grouped population data is given by: 𝜇 =
#

!<D
• The mean for grouped sample data is given by: 𝑥̅ = ?

Where m = midpoint of a class


f = frequency of a class
Example – grouped mean
• The table below gives the times in seconds it takes 21 athletes to
finish a race
Seconds Frequency
51-55 2
56-60 7
61-65 8
66-70 4

• Find the mean.


Solution
• Firstly, we must find the midpoints of each of the intervals as follows:

Seconds Frequency Midpoint


51-55 2 )%,))
= 53
-
56-60 7
61-65 8
66-70 4
Solution Cont’d

'23
• The mean for grouped sample data is given by: 𝑥̅ = -

Seconds Midpoint (m) Frequency (f) mf


51-55 53 2
56-60 58 7
61-65 8
66-70 4
21 Σ mf =

'23
• 𝑥̅ = =
-
Solution cont’d

Seconds Midpoint (m) Frequency (f) mf


51-55 53 2 106
56-60 58 7 406
61-65 63 8 504
66-70 68 4 272
21 Σ mf = 1288

!<D /*++
• 𝑥̅ = = = 61.3 seconds
? */
Variance & Standard deviation of grouped
data
• The formula for standard deviation and variance of grouped data is given by:
! !
! " #$ % ! " #$(̅
• 𝜎2 = 2
s=
& )$*

Where 𝜎2 = population variance


s2= sample variance
m = midpoint of a class

The standard deviation is found by taking the square root of each.


Population Standard deviation = 𝜎2 Sample standard deviation = 𝑠2
Example

• The data below show the time in minutes for 25 employees to get to
work

Commute time Frequency


0 to less than 10 4
10 to less than 20 9
20 to less than 30 6
30 to less than 40 4
40 to less than 50 2

Calculate the mean, standard deviation and variance.


Solution
Commute time Frequency (f) Midpoint (m) mf m–𝝁 (m –𝝁 )2 f( m - 𝝁 )2
0 to less than 10 4 5 20
10 to less than 20 9 15
20 to less than 30 6
30 to less than 40 4
40 to less than 50 2
25

!./
𝑥̅ =
0
Solution cont’d
Commute time Frequency (f) Midpoint (m) mf m–𝜇 (m – 𝜇)2 f( m - 𝜇 )2
0 to less than 10 4 5 20 -16.4 268.96 1075.84
10 to less than 20 9 15 135 -6.4 40.96 368.64
20 to less than 30 6 25 150 3.6 12.96 77.76
30 to less than 40 4 35 140 13.6 184.96 739.84
40 to less than 50 2 45 90 23.6 556.96 1113.92
25 535 Σ = 3376

!./ )$)
𝜇= = = 21.4
0 -)

!
! / .12
𝜎2 =
0

$$'3
𝜎2 = -)
= 135.04

Standard deviation = 135.04 = 11.62


Additional readings
1. Chebyshev ‘s theorem

2. Empirical rule

3. Percentile rank
End of descriptive
statistics ii

You might also like