Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 22

Integrating shape, location and

variability
By integrating the three main features, we can learn not only about the
general global characteristics of the given data, we may also be able to
learn about particular observations within the data.

To that end, we introduce z-score (simply z), the Empirical Rule and
Chebyshev’s Theorem that are based on the three main aspects of
data.
z-scores
Z-score (or simply z) is used to measure the location of a particular
value in the data relative to the mean.

Example: Suppose in a statistics mid-term exam (40 students in class),


the average score was 75 with a standard deviation of 10. Let’s consider
scores of two particular students in the class.
Student #1: scores 65
Student #2: scores 95
Z-scores
z (of student 1) = (65 – 75) / 10 = -1 (is NEGATIVE!)
z (of student 2) = (95 – 75)/ 10 = 2 (is POSITIVE!)

Interpretation: Student 1’s score is 1 standard deviation below the


class average.
Student 2’s score is 2 standard deviations above the
class average.
So, any student’s relative standing relative to the class average can be
measured using the z-score.
Empirical Rule
When the shape of the distribution/histogram is bell-shaped (mound
shaped) , symmetric and unimodal – in short mimics what is known as
in statistics the “Normal distribution”,

- Approximately 68% of the values are within one standard


deviation of the mean.
- Approximately 95% of values are within two standard deviations of
the mean.
- Almost all (100%) the data values are within three standard
deviations of he mean
The Empirical Rule

X
Frequency

68%

95%
100%

-3s -2s -1s Mean +1s +2s +3s


Empirical Rule
Back to the mid-term exam scores: Suppose that the mid-term exam
scores pretty much behave like a normal distribution. Recall that mean
is 75 and standard deviation is 10. Then, using the empirical rule, the
following statements can be made:

1. approximately 68% of scores are between 65 and 85


2. approximately 95% of scores are between 55 and 95
3. almost all students score between 45 and 100 (note that the
maximum score is 100)
Empirical Rule
For the above mid-term scores example, use the Empirical rule to answer the
following:
1. What percentage of students have scored more than 95?
>> Note that the distribution is symmetric. Furthermore, we know
that approximately 95% have scored between 55 and 95.
Therefore, approximately 2.5% (half of 5%) of students have scored more
than 95 in the exam.

2. What percentage of students have scored less than 65?

>> Note that approximately 68% of students have scored between


65 and 85. Therefore, approximately 16% (which is half of 32%)
have scored below 65.
Outlier detection using the Empirical Rule

Sometimes a data will have one or more observations with unusually large or
unusually small values.
These extreme values are often called outliers. These are values that are out of
the ordinary or are not typical!
How can we tell whether or not a value is an outlier? This is a difficult question
to answer even for an experienced statistician.
But, if one is willing to assume the data is normal or close to normal, any value
with a z-score value below -3 or more than 3 may be considered an outlier. The
justification is that: per Empirical Rule, almost all values should be within 3
standard deviations of the mean. If not, that value must be an outlier.
What if the data is not normal? We will come back to this later.
Can we use the Empirical rule for the following
distributions? The answer is NO, NO, NO
Do not use Empirical rule - when

The distribution is skewed

The distribution has more than one mode


Chebyshev’s Theorem
When the data does not resemble NORMAL, we use what is known as
Chebyshev’s Theorem in place of the Empirical rule.

Chebyshev’s theorem is not as precise as the Empirical rule because it


applies to any data distribution. Of course, as anything in real life, we
pay a price for being more general. The price being lack of precision in
the predictions that we make.
Chebyshev’s Theorem
Recall that z denotes the z-score of a value in a given data

For z >1, Chebyshev’s theorem states that


at least 100% of the data values must be
within z standard deviations of the mean.
Chebyshev’s Theorem
Example: Average mid-term exam score is 75 with a standard deviation
of 5. Close inspection of the histogram of the mid-term scores indicate
that the distribution is quite skewed to the right.
Required: Use the Chebyshev’s Theorem to predict the following:
1. What percentage of class scores are within 2 standard deviation
of the mean?
2. What percentage of observations are within 4 standard
deviations of the mean?
Chebyshev’s Theorem
1. Note that z = 2. Now, substitute z=2 into the formula given in the
theorem,
= 0.75
Therefore, at least 75% of scores are within 2 standard deviations of the
mean.
2. Note that z = 4. Now, substitute z=4 into the formula given in the
theorem,
= 0.94
Therefore, at least 94% of scores are within 4 standard deviations of the
mean.
Another Example on Chebyshev’s Theorem
The midterm test scores for 100 students in a college business statistics course
had a mean of 70 and a standard deviation of 5.
Required: predict the percentage of students who scored between 58 and 82?
Solution:
z (for 58) = (58 – 70 )/ 5 = -2.4;
z (for 82) = (82 – 70) / 5 = 2.4
Therefore, the needed z for using Chebyshev’s Theorem is z = 2.4. Now,
substituting z=2.4 in the formula,
= 0.826.
Therefore, at least 82.6% of students have test scores between 58 and 82.
Empirical or Chebyshev’s?

Solve the following two problems using the


appropriate rule?

This time, you need to figure out what rule to


apply.
Solve Please – the main learning goal is to
decide which rule to apply!
Suppose the results of a survey showed that the
distribution of the price of single family homes in Orange
County has a mean of $400,000 and a standard deviation of
$25,000.
1. If the distribution of home prices can be considered mound-shaped
and symmetric, what percentage of single family homes in Orange
County will have a price of between $350,000 and $450,000?

2. If the shape of the distribution of home prices is highly skewed to the


right, what percentage of single family homes in Orange County will
have a price of more than $325,000 and less than $475,000?
Solution
The givens are:

1. Because the data is normal, we use the Empirical Rule.


z (for 350,000) = (350,000 – 400, 000)/ 25,000 = -2
z (for 450,000) = (450,000 – 400,000)/25,000 = 2
Therefore, we are asked to predict the percentage of observations
that are within 2 standard deviations of the mean. Per Empirical rule,
the answer is approximately 95%.
Solution, continued…
2. Because the data is not normal, we use Chebyshev’s Theorem.
z (for 325,000) = (325,000 – 400,000)/25,000 = -3
z (for 475,000) = (475,000 – 400,000)/25,000 = 3

The z required to use the Chebyshev’s Theorem is simply z=3. Plugging


that into the formula
= 0.889
Therefore, at least 88.9% of the observations are within 3 standard
deviations of the mean.
Solve Please – the learning goal is to decide
which rule to apply!
Results of a survey revealed that the distribution of the amount of the
monthly utility bill of a 3-bedroom house using gas or electric energy had
a mean of $97 and a standard deviation of $12.

1. If the distribution of monthly bills can be considered mound-shaped


and symmetric, what percentage of homes will have a monthly bill of
more than $85 and less than $109?
2. If nothing is known about the shape of the distribution of monthly
bills, what percentage of homes will have a monthly bill between $61
and $133?
3. If the distribution of monthly bills can be considered mound-shaped
and symmetric, what percentage of homes will have a monthly bill of
more than $121?
Solution

This is for you left to do….


VIDEOS are available
In Course Titanium exercise videos are available.
1. Shape, median and z-score
2. Chebyshev’s Rule
and Empirical Rule
Please watch these videos immediately after this
section has been covered in class.

You might also like