Descriptive Statistics - Numerical Measures

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 102

1/71

Statistics
Descriptive Statistics Numerical
Measures
2/71
Contents
Measures of location
Measures of variability
Measures of distribution shape , relative
location , and detecting outliers
Exploratory data analysis
Measures of association between two variables
The weighted mean and working with grouped
data

3/71
Contents
Measures of Distribution Shape, Relative
Location, and Detecting Outliers
Exploratory Data Analysis
Measures of Association Between Two
Variables
The Weighted Mean and Working with
Grouped Data
STATISTICS in PRACTICE
Small Fry Design is a toy and accessory
company that designs and imports products
for infants.
Cash flow management is one of the
most critical activities in the day-to-
day operation of this company.
STATISTICS in PRACTICE
A critical factor in cash flow management is
the analysis and control of accounts receivable.
By measuring the average age and dollar value
of outstanding invoices.
The company set the following goals: the
average age for outstanding invoices should not
exceed 45 days, and the dollar value of
invoices more than 60 days old should not
exceed 5% of the dollar value of all accounts
receivable.

Measures of Location
If the measures are computed for data from a
sample , they are called sample statistics.
If the measures are computed for data from a
population , they are called population
parameters.
A sample statistic is referred to as the point
estimator of the corresponding population
parameter.
Mean
The mean of a data set is the average of
all the data values.
Population mean .
Sample mean
The sample mean is the point estimator
of the population mean .
i i
i
w x
x
w
=

i i
i
w x
x
w
=

Sample Mean x
sample the nsin observatio of Number
ns observatio n the of values the of Sum
= =

n
x
x
i
Population Mean
population the in ns observatio of Number
ns observatio N the of values the of Sum
= =

N
x
i

Sample Mean
Example: Monthly Starting Salaries for a
Sample of 12 Business School Graduates
Data



Sample Mean
The mean monthly starting salary
2940
12
35280
12
2880 2950 2850
12
12 2 1
= =
+ + +
=
+ + +
= =

x x x
n
x
x
i
Seventy efficiency apartments
were randomly sampled in
a small college town. The
monthly rent prices for
these apartments are listed
in ascending order on the next
slide.
Sample Mean
Example: Apartment Rents

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Sample Mean
Sample Mean
Median
Whenever a data set has extreme values, the
median is the preferred measure of central
location.
The median of a data set is the value in the
middle when the data items are arranged in
ascending order.
Median
A few extremely large incomes or property values
can inflate the mean.
The median is the measure of location most often
reported for annual income and property value
data.
Median
Example: Monthly Starting Salaries for a
Sample of 12 Business School Graduates
We first arrange the data in ascending order.


Because n = 12 is even, we identify the middle
two values: 2890 and 2920.
2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325

Middle Two Values
2905
2
2920 2890
Median =
+
=
Median
12 14 19 26 27 18 27
For an odd number of observations:

in ascending order
26 18 27 12 14 27 19
7 observations
the median is the middle value.
Median = 19
12 14 19 26 27 18 27
Median
For an even number of observations:

in ascending order
26 18 27 12 14 27 30 8 observations
the median is the average of the middle two values.
Median = (19 + 26)/2 = 22.5
19
30
Mode
Example: frequency distribution of 50
Soft Drink Purchases






The mode, or most frequently purchased
soft drink, is Coke Classic.
Soft Drink Frequency
Coke Classic 19
Diet Coke 8
Dr. Pepper 5
Pepsi-Cola 13
Sprite 5
Total 50
Mode
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
450 occurred most frequently (7 times)
Mode = 450
Percentiles
A percentile provides information about how the
data are spread over the interval from the smallest
value to the largest value.
Admission test scores for colleges and
universities are frequently reported in terms
of percentiles.
The pth percentile of a data set is a value such
that at least p percent of the items take on this
value or less and at least (100 - p) percent of
the items take on this value or more.
Percentiles
Percentiles
Example: Monthly Starting Salaries for a
sample of 12 Business School Graduates
Let us determine the 85th percentile for the
starting salary data




Percentiles
Step 1. Arrange the data in ascending order.
2710 2755 2850 2880 2880 2890 2920 2940
2950 3050 3130 3325
Step 2.

Step 3.
Because i is not an integer, round up. The
position of the 85th percentile is the next
integer greater than 10.2, the 11th position.
2 . 10 12
100
85
100
=
|
.
|

\
|
=
|
.
|

\
|
= n
P
i
Percentiles
Arrange the data in ascending order.
Compute index i, the position of the pth
percentile.

If i is not an integer, round up. The p th
percentile is the value in the i th position.
If i is an integer, the p th percentile is the average
of the values in positions i and i +1.
i = (p/100)n
90
th
Percentile
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
i = (p/100)n = (90/100)70 = 63
Averaging the 63rd and 64th data values:
90th Percentile = (580 + 590)/2 = 585
90
th
Percentile
At least 90%
of the items
take on a value
of 585 or less.
At least 10%
of the items
take on a value
of 585 or more.
7/70 = .1 or 10% 63/70 = .9 or 90%
Quartiles
Quartiles are specific percentiles.
First Quartile = 25th Percentile
Second Quartile = 50th Percentile = Median
Third Quartile = 75th Percentile
Quartiles
Third Quartile
Third quartile = 75th percentile
i = (p/100)n = (75/100)70 = 52.5 = 53
Third quartile = 525
Measures of Variability
It is often desirable to consider measures of
variability (dispersion), as well as measures
of location.
For example, in choosing supplier A or supplier
B we might consider not only the average
delivery time for each, but also the variability
in delivery time for each.
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
Range
The range of a data set is the difference between
the largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest
data values.
Range
Range = largest value - smallest value
Range = 615 - 425 = 190
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Interquartile Range
The interquartile range of a data set is the
difference between the third quartile and the
first quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data
values.
Interquartile Range
3rd Quartile (Q3) = 525
1st Quartile (Q1) = 445
Interquartile Range = Q3 - Q1 = 525 - 445 = 80
The variance is a measure of variability that
utilizes all the data.
Variance
It is based on the difference between the value of
each observation (x
i
) and the mean ( for
a sample, for a population).
x
Variance








The variance is the average of the squared
differences between each data value and the mean.
for a sample
for a population
o

2
2
=

( ) x
N
i
s
x
i
x
n
2
2
1
=

( )
The variance is computed as follows:
Standard Deviation
The standard deviation of a data set is the
positive square root of the variance.
It is measured in the same units as the data,
making it more easily interpreted than the
variance.
The standard deviation is computed as follows:





for a sample
for a population
Standard Deviation
s s =
2
o o =
2
The coefficient of variation is computed as follows:





Coefficient of Variation
100 %
s
x
| |

|
\ .
The coefficient of variation indicates how large
the standard deviation is in relation to the mean.
for a sample for a population
100 %
o

| |

|
\ .
Variance
Standard Deviation
Variance, Standard Deviation,
And Coefficient of Variation
2
2996.47 54.74 s s = = =
Coefficient of Variation
Variance, Standard Deviation,
And Coefficient of Variation
| | | |
| | = =
| |
\ . \ .
54.74
100 % 100 % 11.15%
490.80
s
x
the standard deviation is about 11% of
of the mean .
Measures of Distribution Shape,
Relative Location, and Detecting
Outliers
Distribution Shape
z-Scores
Chebyshevs Theorem
Empirical Rule
Detecting Outliers
Distribution Shape: Skewness
An important measure of the shape of a
distribution is called skewness.
The formula for computing skewness for a
data set is somewhat complex.
Note: The formula for the skewness of
sample data

3
) 2 )( 1 (
skewness

|
.
|

\
|


=
s
x x
n n
n
i
Distribution Shape: Skewness

Skewness can be easily computed using
statistical software.
Distribution Shape: Skewness
Symmetric (not skewed)
Skewness is zero.
Mean and median are equal.
Distribution Shape: Skewness
R
e
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y

.05
.10
.15
.20
.25
.30
.35
0
Skewness = 0
Distribution Shape: Skewness
Moderately Skewed Left
Skewness is negative.
Mean will usually be less than the median.
Distribution Shape: Skewness
R
e
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y

.05
.10
.15
.20
.25
.30
.35
0
Skewness = - .31
Distribution Shape: Skewness
Moderately Skewed Right
Skewness is positive.
Mean will usually be more than the median.
R
e
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y

.05
.10
.15
.20
.25
.30
.35
0
Skewness = .31
Distribution Shape: Skewness
Highly Skewed Right
Skewness is positive (often above 1.0).
Mean will usually be more than the median.
Distribution Shape: Skewness
R
e
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y

.05
.10
.15
.20
.25
.30
.35
0
Skewness = 1.25
Distribution Shape: Skewness
Seventy efficiency apartments
were randomly sampled in
a small college town. The
monthly rent prices for
these apartments are listed
in ascending order on the next
slide.
Distribution Shape: Skewness
Example: Apartment Rents

425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Distribution Shape: Skewness
R
e
l
a
t
i
v
e

F
r
e
q
u
e
n
c
y

.05
.10
.15
.20
.25
.30
.35
0
Skewness = .92
Distribution Shape: Skewness
The z-score is often called the standardized value.
It denotes the number of standard deviations a
data value x
i
is from the mean.



z-Scores
z
x x
s
i
i
=

z
x x
s
i
i
=

A data value less than the sample mean will
have a z-score less than zero.
A data value greater than the sample mean
will have a z-score greater than zero.
A data value equal to the sample mean will
have a z-score of zero.
An observations z-score is a measure of the
relative location of the observation in a data
set.
z-Scores
z-Score of Smallest Value (425)
-1.20 -1.11 -1.11 -1.02 -1.02 -1.02 -1.02 -1.02 -0.93 -0.93
-0.93 -0.93 -0.93 -0.84 -0.84 -0.84 -0.84 -0.84 -0.75 -0.75
-0.75 -0.75 -0.75 -0.75 -0.75 -0.56 -0.56 -0.56 -0.47 -0.47
-0.47 -0.38 -0.38 -0.34 -0.29 -0.29 -0.29 -0.20 -0.20 -0.20
-0.20 -0.11 -0.01 -0.01 -0.01 0.17 0.17 0.17 0.17 0.35
0.35 0.44 0.62 0.62 0.62 0.81 1.06 1.08 1.45 1.45
1.54 1.54 1.63 1.81 1.99 1.99 1.99 1.99 2.27 2.27
z-Scores
Standardized Values for Apartment Rents
Chebyshevs Theorem
At least (1 - 1/z
2
) of the items in any data set will
be within z standard deviations of the mean,
where z is any value greater than 1.
At least of the data values must be
within of the mean.
75%
z = 2 standard deviations
Chebyshevs Theorem
At least of the data values must be
within of the mean.
89%
z = 3 standard deviations
At least of the data values must be
within of the mean.
94%
z = 4 standard deviations
For example:
Chebyshevs Theorem
At least (1 - 1/(1.5)
2
) = 1 - 0.44 = 0.56 or 56%
of the rent values must be between
and
(Actually, 86% of the rent valuesare between
409 and 573.)
- z(s) = 490.80 - 1.5(54.74) = 409
i i
i
w x
x
w
=

+ z(s) = 490.80 + 1.5(54.74) = 573


i i
i
w x
x
w
=

Let z = 1.5 with = 490.80 and s = 54.74


i i
i
w x
x
w
=

Empirical Rule
For data having a bell-shaped distribution:
of the values of a normal random variable
are within of its mean.
68.26%
+/- 1 standard deviation
of the values of a normal random variable
are within of its mean.
95.44%
+/- 2 standard deviations
of the values of a normal random variable
are within of its mean.
99.72%
+/- 3 standard deviations
Empirical Rule
x
3o 1o
2o
+ 1o
+ 2o
+ 3o

68.26%
95.44%
99.72%
Detecting Outliers
An outlier is an unusually small or unusually
large value in a data set.
A data value with a z-score less than -3 or
greater than +3 might be considered an outlier.
Detecting Outliers
It might be:
an incorrectly recorded data value
a data value that was incorrectly included
in the data set
a correctly recorded data value that belongs
in the data set
Detecting Outliers
The most extreme z-scores are -1.20 and 2.27
Using |z| > 3 as the criterion for an outlier,
there are no outliers in this data set.
Exploratory Data Analysis
Five-Number Summary
Box Plot
Five-Number Summary
1 Smallest Value
First Quartile
Median
Third Quartile
Largest Value
2
3
4
5
Five-Number Summary
Example: Monthly Starting Salaries for a
sample of 12 Business School Graduates

Five-Number Summary

2710 2755 2850 2880 2880 2890 2920 2940 2950 3050 3130 3325
Q
1
=2865 Q
2
=2905 Q
3
=3000
(Median)
Five-Number Summary
425 430 430 435 435 435 435 435 440 440
440 440 440 445 445 445 445 445 450 450
450 450 450 450 450 460 460 460 465 465
465 470 470 472 475 475 475 480 480 480
480 485 490 490 490 500 500 500 500 510
510 515 525 525 525 535 549 550 570 570
575 575 580 590 600 600 600 600 615 615
Lowest Value = 425 First Quartile = 445
Median = 475
Third Quartile = 525 Largest Value = 615
A box is drawn with its ends located at
the first and third quartiles.
Box Plot
A vertical line is drawn in the box at the
location of the median (second quartile).
375 400 425 450 475 500 525 550 575 600 625
Box Plot
Q1 = 445 Q3 = 525
Q2 = 475
Box Plot
Limits are located (not drawn) using the
interquartile range (IQR).
Data outside these limits are considered
outliers.
The locations of each outlier is shown with
the symbol
*
.
continued
Lower Limit: Q1 - 1.5(IQR) = 445 - 1.5(75)
=332.5
Upper Limit: Q3 + 1.5(IQR) = 525 + 1.5(75)
= 637.5
The lower limit is located 1.5(IQR) below Q1.
The upper limit is located 1.5(IQR) above Q3.
There are no outliers (values less than 332.5 or
greater than 637.5) in the apartment rent data.
Box Plot
Box Plot
Whiskers (dashed lines) are drawn from the
ends of the box to the smallest and largest
data values inside the limits.
375 400 425 450 475 500 525 550 575 600 625
Smallest value
inside limits = 425
Largest value
inside limits = 615
Box Plot
Example: Monthly Starting Salaries for a
Sample of 12 Business School Graduates
Box Plot


Measures of Association
Between Two Variables
Covariance
Correlation Coefficient
Covariance
Positive values indicate a positive relationship.
Negative values indicate a negative relationship.
The covariance is a measure of the linear
association between two variables.
Covariance
The correlation coefficient is computed as follows:








for
samples
for
populations
s
x x y y
n
xy
i i
=

( )( )
1
s
x x y y
n
xy
i i
=

( )( )
1
o

xy
i x i y
x y
N
=

( )( )
o

xy
i x i y
x y
N
=

( )( )
Covariance
Example: Sample Data for the Stereo and
Sound Equipment Store
Data

Covariance
Scatter Diagram for the Stereo and Sound
Equipment Store





Sample Covariance
11
9
99
1
) )( (
= =


=

n
y y x x
S
i i
xy
Covariance
Partitioned Scatter Diagram for the Stereo
and Sound Equipment Store


Correlation Coefficient
Values near +1 indicate a strong positive linear
relationship.
Values near -1 indicate a strong negative linear
relationship.
The coefficient can take on values between
-1 and +1.
The correlation coefficient is computed as follows:






for
samples
for
populations
r
s
s s
xy
xy
x y
= r
s
s s
xy
xy
x y
=

o
o o
xy
xy
x y
=
o
o o
xy
xy
x y
=
Correlation Coefficient
Just because two variables are highly correlated
, it does not mean that one variable is the cause of
the other.
Correlation is a measure of linear association
and not necessarily causation.
Correlation Coefficient
A golfer is interested in investigating
the relationship, if any, between driving
distance and 18-hole score.
277.6
259.5
269.1
267.0
255.6
272.9
69
71
70
70
71
69
Average Driving
Distance (yds.)
Average
18-Hole Score
Covariance and Correlation
Coefficient
277.6
259.5
269.1
267.0
255.6
272.9
69
71
70
70
71
69
x y
10.65
-7.45
2.15
0.05
-11.35
5.95
-1.0
1.0
0
0
1.0
-1.0
-10.65
-7.45
0
0
-11.35
-5.95
Average
Std. Dev.
267.0 70.0 -35.40
8.2192 .8944
Total
Covariance and Correlation
Coefficient
Sample Covariance
Sample Correlation Coefficient
7.08
-.9631
(8.2192)(.8944)
xy
xy
x y
s
r
s s

= = =
7.08
-.9631
(8.2192)(.8944)
xy
xy
x y
s
r
s s

= = =
( )( )
35.40
7.08
1 6 1
i i
xy
x x y y
s
n


= = =

( )( )
35.40
7.08
1 6 1
i i
xy
x x y y
s
n


= = =

Covariance and Correlation


Coefficient
The Weighted Mean and
Working with Grouped Data
Weighted Mean
Mean for Grouped Data
Variance for Grouped Data
Standard Deviation for Grouped Data
Weighted Mean
When the mean is computed by giving each
data value a weight that reflects its importance,
it is referred to as a weighted mean.
In the computation of a grade point average
(GPA), the weights are the number of credit
hours earned for each grade.
When data values vary in importance, the
analyst must choose the weight that best
reflects the importance of each value.
Weighted Mean
i i
i
w x
x
w
=

i i
i
w x
x
w
=

where:

x
i
= value of observation i
w
i
= weight for observation i
Grouped Data
The weighted mean computation can be
used to obtain approximations of the mean,
variance, and standard deviation for the
grouped data.
To compute the weighted mean, we treat the
midpoint of each class as though it were the
mean of all items in the class.
Grouped Data
We compute a weighted mean of the class
midpoints using the class frequencies as
weights.


Similarly, in computing the variance and
standard deviation, the class frequencies are
used as weights.
Mean for Grouped Data
i i
f M
x
n
=
i i
f M
x
n
=

N
M f
i i
=
N
M f
i i
=
where:
f
i
= frequency of class i
M
i
= midpoint of class i
Sample Data
Population Data
Given below is the previous sample of
monthly rents for 70 efficiency apartments,
presented here as grouped
data in the form of a
frequency distribution.
Rent ($) Frequency
420-439 8
440-459 17
460-479 12
480-499 8
500-519 7
520-539 4
540-559 2
560-579 4
580-599 2
600-619 6
Sample Mean for Grouped
Data
This approximation
differs by $2.41
from the actual
sample mean of
$490.80.
34, 525
493.21
70
x = =
34, 525
493.21
70
x = =
Rent ($) f
i
420-439 8
440-459 17
460-479 12
480-499 8
500-519 7
520-539 4
540-559 2
560-579 4
580-599 2
600-619 6
Total 70
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
Sample Mean for Grouped
Data
Variance for Grouped Data
s
f M x
n
i i
2
2
1
=

( )
s
f M x
n
i i
2
2
1
=

( )
o

2
2
=

f M
N
i i
( )
o

2
2
=

f M
N
i i
( )
For sample data
For population data
Rent ($) f
i
420-439 8
440-459 17
460-479 12
480-499 8
500-519 7
520-539 4
540-559 2
560-579 4
580-599 2
600-619 6
Total 70
Rent ($)
420-439
440-459
460-479
480-499
500-519
520-539
540-559
560-579
580-599
600-619
Total
M
i
- x
-63.7
-43.7
-23.7
-3.7
16.3
36.3
56.3
76.3
96.3
116.3
f
i
(M
i
- x)
2
32471.71
32479.59
6745.97
110.11
1857.55
5267.86
6337.13
23280.66
18543.53
81140.18
208234.29
(M
i
- x)
2
4058.96
1910.56
562.16
13.76
265.36
1316.96
3168.56
5820.16
9271.76
13523.36
continued
Sample Variance for
Grouped Data
3, 017.89 54.94 s = = 3, 017.89 54.94 s = =
s
2
= 208,234.29/(70 1) = 3,017.89
This approximation differs by only $.20
from the actual standard deviation of $54.74.
Sample Variance
Sample Standard Deviation
Sample Variance for
Grouped Data

You might also like