Descriptive Statistics - Numerical Measures

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

Data Science for

Managerial Decisions
Concepts covered in the last two sessions?

2
Descriptive Statistics - Numerical Measures

◼ Measures of Location
◼ Measures of Variability

3
Measures of Location

Mean
If the measures are computed
Median for data from a sample,
Mode they are called sample statistics.
Percentiles
Quartiles If the measures are computed
for data from a population,
they are called population parameters.

A sample statistic is referred to


as the point estimator of the
corresponding population parameter.

4
Mean
◼ Perhaps the most important measure of location is the mean.
◼ The mean provides a measure of central location.
◼ The mean of a data set is the average of all the data values.
◼ The sample mean 𝑥ҧ is the point estimator of the population mean 𝜇.
Sample Mean 𝑥ҧ

Sum of the values


of the n observations

x i
x=
n
Number of observations
in the sample

6
Population Mean 𝜇

Sum of the values


of the N observations

x i
=
N
Number of observations
in the population

7
Sample Mean
Example: Apartment Rents
Seventy apartments were randomly sampled in a small college town. The monthly rent prices for these
apartments are listed below.

445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440
Sample Mean
Example: Apartment Rents
Seventy apartments were randomly sampled in a small college town. The monthly rent prices for these
apartments are listed below.
445 615 430 590 435 600 460 600 440 615
440 440 440 525 425 445 575 445 450 450
465 450 525 450 450 460 435 460 465 480
450 470 490 472 475 475 500 480 570 465
600 485 580 470 490 500 549 500 500 480
570 515 450 445 525 535 475 550 480 510
510 575 490 435 600 435 445 435 430 440

x=
 x i
=
34, 356
= 490.80
n 70
Mean

◼ Imagine ten folks are sitting on bar stools in a Seattle

◼ Each of them earns $35,000 a year, which makes the mean annual
income for the group as ………….

◼ Now Jack Dorsey walks into the bar; assume that Dorsey has annual
income of $1 billion

◼ Now the mean annual income of the bar patrons rises to ……….

10
Mean

◼ Imagine ten folks are sitting on bar stools in a Seattle

◼ Each of them earns $35,000 a year, which makes the mean annual
income for the group as $35,000.

◼ Now Jack Dorsey walks into the bar; assume that Dorsey has annual
income of $1 billion

◼ Now the mean annual income of the bar patrons rises to $90,940,909.

11
Mean

Income Freq Total


$35,000 10 $350,000
$1,000,000,000 1 $1,000,000,000
Avg $90,940,909

12
Limitations of using Mean

◼ If I were to describe the patrons of this bar as having an average annual


income of $91 million, the statement would be statistically correct yet
grossly misleading

◼ This isn’t a bar where multimillionaires hang out; it’s a bar where a bunch
of folks with relatively low incomes happen to be sitting next to Dorsey

◼ Because there has been explosive growth in incomes at the top end of the
distribution, the average income gets heavily skewed by the ultra rich

13
Quiz Time

◼ Calculation of mean is affected by __________ values

◼ If there are very high or very low values in that data that don’t look
like most of the data, then the mean is _______ ____________

◼ Fortunately, there are measures that do not suffer from this


shortcoming

14
Quiz Time

◼ Calculation of mean is affected by extreme values

◼ If there are very high or very low values in that data that don’t look
like most of the data, then the mean is not representative

◼ Fortunately, there are measures that do not suffer from this


shortcoming

15
Median
◼ A few extremely large values can inflate the mean
◼ Recall the previous example
◼ The median of a data set is the value in the middle when the data items are arranged in
ascending order
◼ Whenever a data set has extreme values, the median is the preferred measure of central
location
◼ The median is a robust estimator of central location but says nothing about how the
data on either side of its value is spread or dispersed
Median

For an odd number of observations:

26 18 27 12 14 27 19 7 observations

12 14 18 19 26 27 27 in ascending order

the median is the middle value.

Median = 19
Median
For an even number of observations:

26 18 27 12 14 27 30 19 8 observations

12 14 18 19 26 27 27 30 in ascending order

the median is the average of the middle two values.

Median = (19 + 26)/2 = 22.5


Median
◼ Example: Apartment Rents

Averaging the 35th and 36th data values:


Median = (475 + 475)/2 = 475
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Note: Data is in ascending order.
Mode
◼ The mode of a data set is the value that occurs with greatest frequency.
◼ The greatest frequency can occur at two or more different values.
◼ If the data have exactly two modes, the data are bimodal.
◼ If the data have more than two modes, the data are multimodal.
Exercise – Compute Mean , Median and Mode

21
Exercise – Compute Mean , Median and Mode

Sum 80 Sum 98
Count 16 Count 16
Mean 5 Mean 6.13
Median 5 Median 7
Mode 5 Mode 8

Sum 80
Count 16
Mean 5
Median 5
Mode 2,8

22
Percentiles
◼ A percentile provides information about how the data are spread over the interval
from the smallest value to the largest value.
◼ Admission test scores for colleges and universities are frequently reported in terms of
percentiles.
◼ The pth percentile of a data set is a value such that at least p percent of the items take
on this value or less and at least (100 - p) percent of the items take on this value or
more.
Percentiles
Arrange the data in ascending order.

Compute index i, the position of the pth percentile.


i = (p/100)n

If i is not an integer, round up. The pth percentile


is the value in the ith position.

If i is an integer, the pth percentile is the average


of the values in positions i and i+1.
80th Percentile
Example: Apartment Rents
i = (p/100)n = (80/100)70 = 56
Averaging the 56th and 57th data values:
80th Percentile = (535 + 549)/2 = 542
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Note: Data is in ascending order.
80th Percentile
Example: Apartment Rents

“At least 80% of the “At least 20% of the


items take on a items take on a
value of 542 or less.” value of 542 or more.”
56/70 = .8 or 80% 14/70 = .2 or 20%

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Quartiles

◼ Quartiles are specific percentiles.


◼ First Quartile = 25th Percentile
◼ Second Quartile = 50th Percentile = Median
◼ Third Quartile = 75th Percentile
Third Quartile
Example: Apartment Rents

Third quartile = 75th percentile


i = (p/100)n = (75/100)70 = 52.5 = 53
Third quartile = 525
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Note: Data is in ascending order.
Measures of Variability

Package of Package of
company 1 company 2
25 17
20 16
15 15
10 14
5 13
Avg: 15 Avg: 15
Measures of Variability
◼ Range
◼ Interquartile Range
◼ Variance
◼ Standard Deviation
◼ Coefficient of Variation
Range
◼ The range of a data set is the difference between the largest and smallest data values.
◼ It is the simplest measure of variability.
◼ It is very sensitive to the smallest and largest data values.
Range
Example: Apartment Rents

Range = largest value - smallest value


Range = 615 - 425 = 190
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Note: Data is in ascending order.
Interquartile Range

◼ The interquartile range of a data set is the difference between the third quartile and
the first quartile.
◼ It is the range for the middle 50% of the data.
◼ It overcomes the sensitivity to extreme data values.
Interquartile Range
Example: Apartment Rents

3rd Quartile (Q3) = 525


1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Note: Data is in ascending order.
Variance
◼ The variance is a measure of variability that utilizes all the data.
◼ It is based on the difference between the value of each observation (xi) and the
mean (𝑥ҧ for a sample,  for a population).
◼ The variance is useful in comparing the variability of two or more variables.
Variance - Example

Monthly Salary Sample Mean Deviation about Squared


(in US$) the Mean Deviation About
the Mean
3850 3940 -90 8100
3950 3940 +10 100
4050 3940 +110 12100
3880 3940 -60 3600
3970 3940 +30 900

◼ Sum of deviations about the mean ?


◼ Sum of squared deviations about the mean ?
Variance

The variance is computed as follows:

2  ( xi − x )
2
 ( xi −  )
2
s =  =
2
n −1 N

for a for a
sample population
Variance

The variance is computed as follows:

2  ( xi − x )
2
 ( xi −  )
2
s =  =
2
n −1 N

for a for a
sample population

◼ In most statistical applications, the data being analyzed for a sample.


◼ When we compute a sample variance, we are often interested in using it to estimate the population
variance.
◼ If the sum of the squared deviations about the sample mean is divided by (n-1), and not n, the resulting
sample variance provides an unbiased estimate of the population variance.
Standard Deviation
◼ The standard deviation of a data set is the positive square root of the variance.
◼ It is measured in the same units as the data, making it more easily interpreted than
the variance.

The standard deviation is computed as follows:

s = s2  = 2

for a for a
sample population
Coefficient of Variation
◼ The coefficient of variation indicates how large the standard deviation is relative
to the mean.

The coefficient of variation is computed as follows:


s   
 100  %  100  %
x   
for a for a
sample population
Coefficient of Variation (CoV)

◼ Stock A ◼ Stock B

◼ Five weeks of average ◼ Five weeks of average prices of


prices of stock A are 57, 68, stock B are 12, 17, 8, 15 and 13
64, 71 and 62
◼ Mean ? ◼ Mean ?
◼ Standard Deviation? ◼ Standard Deviation?
◼ CoV? ◼ CoV?

With standard deviation as a measure of risk, Stock ___ is more risky


With CoV as a measure of risk, Stock ____ is more risky

41
Coefficient of Variation (CoV)

◼ Stock A ◼ Stock B

◼ Five weeks of average ◼ Five weeks of average prices of


prices of stock A are 57, 68, stock B are 12, 17, 8, 15 and 13
64, 71 and 62
◼ Mean = 64.4 ◼ Mean =13
◼ Standard Deviation = 4.84 ◼ Standard Deviation = 3.03
◼ CoV = 7.5% ◼ CoV = 23.3%

With standard deviation as a measure of risk, Stock A is more risky


With CoV as a measure of risk, Stock B is more risky

42
Sample Variance, Standard Deviation,
And Coefficient of Variation
◼ Example: Apartment Rents

Variance
s 2
=
 (x i − x )2
= 2, 996.16
n−1

Standard Deviation
the standard
deviation is
s = s 2 = 2996.16 = 54.74
about 11%
of the mean
Coefficient of Variation

s   54.74 
  100  % =   100  % = 11.15%
x   490.80 
Summary of Last Session
✓ Mean
✓ Median
✓ Mode
✓ Percentiles
✓ Quartiles

✓ Range
✓ Interquartile Range
✓ Variance
✓ Standard Deviation
✓ Coefficient of Variation

44
Distribution Shape: Skewness

◼ An important measure of the shape of a distribution is called skewness.


◼ The formula for the skewness of sample data is

 xi − x 
3
n
Skewness =  
(n − 1)(n − 2)  s 

◼ Skewness can be easily computed using statistical software.


Distribution Shape: Skewness

Symmetric (not skewed)


◼ Skewness is zero.
◼ Mean and median are equal.

Skewness = 0
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Distribution Shape: Skewness

Moderately Skewed Left


◼ Skewness is negative.
◼ Mean will usually be less than the median.

Skewness = − .31
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Distribution Shape: Skewness
Moderately Skewed Right
◼ Skewness is positive.
◼ Mean will usually be more than the median.

Skewness = .31
.35
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Distribution Shape: Skewness
Highly Skewed Right
◼ Skewness is positive (often above 1.0).
◼ Mean will usually be more than the median.

.35
Skewness = 1.25
.30
Relative Frequency

.25
.20
.15
.10
.05
0
Distribution Shape: Skewness
Example: Apartment Rents
Seventy efficiency apartments were randomly sampled in a college town. The
monthly rent prices for the apartments are listed below in ascending order.

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Distribution Shape: Skewness
Example: Apartment Rents

.35 Skewness = .92


.30
Relative Frequency

.25

.20
.15

.10
.05
0
z-Scores
◼ The z-score is often called the standardized value.
◼ It denotes the number of standard deviations a data value xi is away from the mean.

xi − x
zi =
s

𝑥𝑖 = 𝑥ҧ + 𝑠𝑧𝑖
z-Scores
◼ An observation’s z-score is a measure of the relative location of the observation
in a data set.
◼ A data value less than the sample mean will have a z-score less than zero.
◼ A data value greater than the sample mean will have a z-score greater than zero.
◼ A data value equal to the sample mean will have a z-score of zero.
z-Scores
Example: Apartment Rents
◼ z-Score of Smallest Value (425)
xi − x 425 − 490.80
z= = = − 1.20
s 54.74

Standardized Values for Apartment Rents


-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
Detecting Outliers
◼ An outlier is an unusually small or unusually large value in a data set.
◼ A data value with a z-score less than -3 or greater than +3 might be considered
an outlier.
◼ It might be:
➢ an incorrectly recorded data value
➢ a correctly recorded data value that belongs to the data set
Detecting Outliers
Example: Apartment Rents
◼ The most extreme z-scores are -1.20 and 2.27
◼ Using |z| > 3 as the criterion for an outlier, there are no outliers in this data set.

Standardized Values for Apartment Rents


-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
Detecting Outliers

57
Five-Number Summaries
and Box Plots

Summary statistics and easy-to-draw graphs can be


used to quickly summarize large quantities of data.

Two tools that accomplish this are five-number


summaries and box plots.
Five-Number Summaries

1 Smallest Value

2 First Quartile

3 Median

4 Third Quartile

5 Largest Value
Five-Number Summaries
Example: Apartment Rents

Lowest Value = 425 First Quartile = 445


Median = 475
Third Quartile = 525 Largest Value = 615
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Box Plot

A box plot is a graphical summary of data that is based on a five-number summary.

A key to the development of a box plot is the computation of the median and
the quartiles Q1 and Q3.

Box plots provide another way to identify outliers.


Box Plot
Example: Apartment Rents
◼ A box is drawn with its ends located at the first and third quartiles.
◼ A vertical line is drawn in the box at the location of the median (second quartile).

400 425 450 475 500 525 550 575 600 625

Q1 = 445 Q3 = 525
Q2 = 475
Box Plot
◼ Limits are located (not drawn) using the interquartile range (IQR).
◼ Data outside these limits are considered outliers.
◼ The locations of each outlier is shown with the symbol * .
Box Plot
Example: Apartment Rents

The lower limit is located 1.5(IQR) below Q1.

Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(80) = 325

The upper limit is located 1.5(IQR) above Q3.

Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(80) = 645

There are no outliers (values less than 325 or greater than 645)
in the apartment rent data.
Box Plot
Example: Apartment Rents
◼ Whiskers (dashed lines) are drawn from the ends of the box to the
smallest and largest data values inside the limits.

400 425 450 475 500 525 550 575 600 625

Smallest value Largest value


inside limits = 425 inside limits = 615
To draw box plot in Excel:
https://support.microsoft.com/en-us/office/create-a-box-plot-10204530-8cdf-40fe-a711-2eb9785e510f
66
Chebyshev’s Theorem
At least (1 - 1/z2) of the items in any data set will be within z standard deviations of
the mean, where z is any value greater than 1.

Chebyshev’s theorem requires z > 1, but z need not be an integer.


Chebyshev’s Theorem

At least ….. of the data values must be


within z = 2 standard deviations of the mean.

At least ….. of the data values must be


within z = 3 standard deviations of the mean.

At least ….. of the data values must be


within z = 4 standard deviations of the mean.
Chebyshev’s Theorem

At least 75% of the data values must be


within z = 2 standard deviations of the mean.

At least 89% of the data values must be


within z = 3 standard deviations of the mean.

At least 94% of the data values must be


within z = 4 standard deviations of the mean.
Chebyshev’s Theorem
Example: Apartment Rents
Let z = 1.5 with x = 490.80 and s = 54.74

At least (1 − 1/(1.5)2) = 1 − 0.44 = 0.56 or 56%


of the rent values must be between
x - z(s) = 490.80 − 1.5(54.74) = 409
and
x + z(s) = 490.80 + 1.5(54.74) = 573

(Actually, 86% of the rent values are between 409 and 573.)
Empirical Rule
When the data are believed to approximate a bell-shaped distribution …

The empirical rule can be used to determine the percentage of data values that must be within a
specified number of standard deviations of the mean.

The empirical rule is based on the normal distribution, which will be covered later
Empirical Rule

For data having a bell-shaped distribution:


68.26% of the values of a normal random variable
are within +/- 1 standard deviation of its mean.

95.44% of the values of a normal random variable


are within +/- 2 standard deviations of its mean.

99.72% of the values of a normal random variable


are within +/- 3 standard deviations of its mean.
Empirical Rule

99.72%
95.44%
68.26%


x
 – 3  – 1  + 1  + 3
 – 2  + 2
Measures of Association
Between Two Variables

Thus far we have examined numerical methods used to summarize the data for
one variable at a time.

Often a manager or decision maker is interested in the relationship between two


variables.

Two descriptive measures of the relationship between two variables are


covariance and correlation coefficient.
Example: Stereo & Sound Equipment Store

Store’s manager wants to determine relationship between the number of weekend


television commercials shown and the sales at the store during the following week.

75
Example: Stereo & Sound Equipment Store

76
Covariance

The covariance is a measure of the linear association between two variables.

Positive values indicate a positive relationship.

Negative values indicate a negative relationship.


Covariance

The covariance is computed as follows:

 ( xi − x )( yi − y ) for
sxy =
n −1 samples

 ( xi −  x )( yi −  y ) for
 xy = populations
N
Covariance
Example: Stereo & Sound Equipment Store

Store’s manager wants to determine relationship between the number of weekend


television commercials shown and the sales at the store during the following week.

79
Covariance
Example: Stereo & Sound Equipment Store

80
Covariance
Example: Stereo & Sound Equipment Store

➢ Scatter diagram with a vertical dashed line at 𝑥ҧ = 3


➢ Scatter diagram with a horizontal dashed line at 𝑦ത = 51
➢ The lines divide the scatter diagram into four quadrants
81
Covariance
Example: Stereo & Sound Equipment Store

➢ Points in quadrant I correspond to xi greater than 𝑥ҧ and yi greater than 𝑦,


ത points in quadrant II correspond to xi less
than 𝑥ҧ and yi greater than 𝑦,
ത and so on.
➢ Thus, the value of (xi -𝑥)(y
ҧ i -𝑦)
ത must be positive for points in quadrant I, negative for points in quadrant II, positive
for points in quadrant III, and negative for points in quadrant IV.
82
Covariance
Example: Stereo & Sound Equipment Store

➢ If the value of Sxy is +, the points with the greatest influence on Sxy must be in Q1 and Q3. Hence a positive value
for Sxy indicates a positive linear association between x and y.
➢ If the value of Sxy is -, the points with the greatest influence on Sxy must be in Q2 and Q4. Hence a negative value
for Sxy indicates a negative linear association between x and y.
83
Correlation Coefficient

Correlation is a measure of linear association and not necessarily causation.

Just because two variables are highly correlated, it does not mean that one variable
is the cause of the other.
◼ Correlation is NOT causation

85
Correlation Coefficient

The correlation coefficient is computed as follows:


sxy  xy
rxy =  xy =
sx s y  x y

for for
samples populations
Correlation Coefficient

The correlation coefficient can take on values between -1 and +1.

Values near -1 indicate a strong negative linear relationship.

Values near +1 indicate a strong positive linear relationship.

The closer the correlation is to zero, the weaker the relationship.


Covariance and Correlation Coefficient
Example: Stereo & Sound Equipment Store

Sample Covariance

sxy =
 ( x − x )( y
i i − y)
=
99
= 11
n−1 9
Sample Correlation Coefficient
sxy 11
rxy = = = 0.93
sx sy (1.49)(7.93)
Correlation Coefficient ?

89
Correlation Coefficient ?

➢ Correlation coefficient measures only the strength of linear relationship between two variables.
➢ It is possible for the correlation coefficient to be near zero, suggesting no linear relationship, when the
relationship between the two variables is non-linear.
90
Thank You !!!

You might also like