Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 13

Chebyshevs Theorem and The Empirical Rule

Suppose we ask 1000 people what their age is. If this is a representative sample then there will be very few people of 1-2 years old just as there will not be many 95 year olds. Most will have an age somewhere in their 30s or 40s. A list of the number of people of a certain age may look like this:
Age 0 1 2 3 .. .. 30 31 .. .. 60 61 .. .. 80 81 Number of people 1 2 3 8 .. .. 45 48 .. .. 32 30 .. .. 6 3

Next, we can turn this list into a scatter diagram with age on the horizontal axis and the number of people of a certain age on the vertical axis.

From the statistical point of view a scatter diagram may have two shapes. It may be shaped or at least looks approximately like a 'bell curve', which looks like this:

No of people

Age A 'bell curve' is perfectly symmetrical with respect to a vertical line through its peak and is sometimes called a "Gauss curve" or a "normal curve". The second shape a scatter diagram may have is anything but a normal curve as in the next drawing:

We can do a lot of good statistics with the normal curve, but virtually none with any other curve. Let us assume that we have recorded the 1000 ages and computed the mean and standard deviation of these ages. Assuming the mean age came out as 40 years and the standard deviation as 6 years we can do the following predictions.

Chebyshevs Theorem
In the case of a scatter diagram that seems to be anything but a normal curve, all we can go by is Chebyshevs theorem. This very important but rarely used theorem states that in those cases where we have a non-normal distribution, the following can be said abut the individual data, which in this case are the ages: At least 75% of all the ages will lie in the range of X 2 s . In our case this means that at least 75% of the people will have an age in the range of 40 2 6 40 12 years which simplifies to a range of 28 to 52 years.

At least 88.9% of all the ages will lie in the range of X 3 s . In our case this means that at least 88.9% of the people will have an age in the range of 40 3 6 40 18 years which simplifies to a range of 22 to 58 years. At least 93.75% of all the ages will lie in the range of X 4 s . In our case this means that at least 93.75% of the people will have an age in the range of 40 4 6 40 24 years which simplifies to a range of 16 to 64 years. At least 96% of all the ages will lie in the range of X 5 s . In our case this means that at least 96% of the people will have an age in the range of 40 5 6 40 30 years which simplifies to a range of 10 to 70 years. At least 97.2% of all the ages will lie in the range of X 6 s . In our case this means that at least 97.2% of the people will have an age in the range of 40 6 6 40 36 years which simplifies to a range of 4 to 76 years.

How can we calculate these percentages? To calculate the 75%, the 88.9%, the 93.75%, etc, we look at the number of standard deviations in the respective intervals. The 75% goes together with 'mean 1 standard deviation', the 88.9% with 'mean 2 standard deviations', the 93.75% with 'mean 3 standard deviations', and the 96% with 'mean 4 standard deviations'. In general you can say that the percentage of people with an age in the range of 1 "mean k standard deviations" can be found by calculating the value of the quantity 1 2 k and then converting that into a percentage. Summarizing the above we get the following table:
Interval k 2 3 4 5 6

X 2s X 3s

X 4s
X 5s X 6s

1 k2 1 1 2 2 1 1 2 3 1 1 2 4 1 1 2 5 1 1 2 6 1

% 75 88.9 93.75 96 97.2

0.75 0.889 0.9375 0.96 0.972

Do we have to restrict ourselves to whole numbers as values for k? No, we may take any value for k as long as it larger than 1. For instance, for k = 2.5 we get the result that 1 1 0.84 or 84% in the interval 40 2.5 6 40 15 years 2.5 2

Example 1:
Students Who Care is a student volunteer program in which college students donate work time in community centers for homeless people. Professor Gill is the faculty sponsor for this student volunteer program. For several years Dr. Gill has kept a record of the total number of work hours volunteered by s student in the program each semester. For students in the

program, for each semester the mean number of hours was 29.1 hours with a standard deviation of 1.7 hours. Find an interval for the number of hours volunteered in which at least 88.9% of the students in this program would fit. Solution: From the table above we see that a percentage of 88.9 will coincide with an interval of 29.1 3 1.7 29.1 5.1 hours. This can be rewritten as an interval from 24 to 34.2 hours volunteered each semester.

Example 2:
The East Coast Independent News periodically runs ads in its own classified section offering a months free subscription to those who respond. This way management can get a sense about the number of subscribers who read the classified section each day. Careful records have been kept over a period of 2 years. The mean number of responses was 525 with a standard deviation of 30. What is the smallest percentage of responses in the interval between 375 and 675? Solution: The difference between the mean of 525 and the upper limit of this interval is 150. This is 5 standard deviations since 150 / 30 5 . The same is true for the difference between the mean and the lower limit of this interval. According to the table above this coincides with 96%.

The Empirical Rule


When the data values seem to have a normal distribution, or approximately so, we can use a much easier theorem than Chebyshevs. The "empirical rule" states that in cases where the distribution is normal, the following statements are true: Approximately 68% of the data values will fall within 1 standard deviation of the mean. Approximately 95% of the data values will fall within 2 standard deviations of the mean. Approximately 99.7% of the data values will fall within 3 standard deviations of the mean.

Example 3:
The average salary for graduates entering the actuarial field is $60,000. If the salaries are normally distributed with a standard deviation of $5000, then what percentage of the graduates will have a salary between $50,000 and $70,000? Solution: Both $50,000 and $70,000 are $10,000 away from the mean of $60,000. This is two standard deviations away from the mean, so 95% of the graduates will have a salary in this interval.

Chebyshevs theorem example

Use Chebyshevs theorem to find what percent of the values will fall between 161 and 229 for a data set with mean of 195 and standard deviation of 17. - Use the Empirical Rule to find what two values 95% of the data will fall between for a data set with mean 106 and standard deviation of 19. a) The interval (161, 229) can be written as (195-2*17, 195+2*17) which is same as (Mean k*SD, Mean +k*SD), where k =2. According to Chebyshevs theorem, at least 1 - (1/k-squared) of the measurements will fall within (Mean -k*SD, Mean +k*SD) But 1 - (1/k-squared) = 1 - (1/2^2) = 1 0.25= 0.75 Thus 75 percent of the values will fall between 161 and 229 for a data set with mean of 195 and standard deviation of 17. b) According to Empirical rule, approximately 95% of the measurements (data) will fall within two standard deviation of the mean. There fore ( Mean -2*SD, Mean +2*SD) = (106-2*19, 106+2*19) = (68, 144) will contain 95 % of the observations. Thus the two values are 68 and 144.

Chebyshev's Theorem
A mathematician named Chebyshev came up with bounds on how much of the data must lie close to the mean. In particular for any positive k, the proportion of the data that lies within k standard deviations of the mean is at least

1 1 k2
For example, if k = 2 this number is

1 1 2
2

= .75

This tell us that at least 75% of the data lies within 75% of the mean. In the above example, we can say that at least 75% of the diners spent between

49.2 - 2(17) = 15.2


and

49.2 + 2(17) = 83.2 dollars.

Decile

The deciles are the nine values of the variable that divide an ordered data set into ten equal parts.The deciles determine the values for 10%, 20%... and 90% of the data. D5 coincides with the median.

Calculating Deciles
1. Order the data from smallest to largest.

2. Find the place that occupies every decile using the expression the cumulative frequency table.

, in

Li is the lower limit of the decile class. N is the sum of the absolute frequency. Fi-1 is the absolute frequency immediately below the decile class. ai is the width of the class containing the decile class. The deciles are independent of the widths of the classes.

Example
Calculate the deciles of the distribution for the following table: fi [50, 60) [60, 70) [70, 80) [80, 90) [90, 100) [100, 110) [110, 120) 8 10 16 14 10 5 2 65 8 18 34 48 58 63 65 Fi

Calculation of the First Decile

Calculation of the Second Decile

Calculation of the Third Decile

Calculation of the Fourth Decile

Calculation of the Fifth Decile

Calculation of the Sixth Decile

Calculation of the Seventh Decile

Calculation of the Eighth Decile

Calculation of the Ninth Decile

Deciles
The arranged data can be divided into ten equal parts by nine values. These values are called deciles and denoted by D1, D2,.D9. Two different types of formulas are used for the calculation of deciles in case of grouped data (data presented in frequency distribution) and ungrouped data (data in original form). The calculation of deciles for both grouped and ungrouped data is explained below with the help of simple problems.

Calculation of Deciles for Ungrouped Data


Deciles of ungrouped data can be calculated with the help of following formula: Problem: The twelve donors donated the following amount in a charity fund: 500, 850, 925, 800, 600, 750, 650, 625, 800, 400, 725, and 550. Find D4, D7 and D9 Arrange data in ascending order: 400, 500, 550, 600, 625, 650, 725, 750, 800, 800, 850, 925

Fourth Decile (D4)

D4 can be calculated by using the formula: Since 5.2th observations lies between 5th and 6th value in the ordered group, or midway

between 625 and 650 therefore

Seventh Decile (D7)


The calculation of seventh decile is given as:

Ninth Decile (D9)


The calculation of D9 is given below:

Calculation of Deciles for Frequency Distribution


In case of frequency distribution, deciles can be calculated by using the formula:

Problem: The daily earnings of employees working at an industrial complex are given below in table. Find D2, D5 and D9.

Solution:

2nd Decile (D2)


In case of frequency distribution 2nd decile can be calculated by using the formula given

below:

5th Decile

The calculation of 5th decile is given below:

9th Decile
The calculation of 9th decile is shown in the figure below:

Chebyshev's Theorem
The proportion of the values that fall within k standard deviations of the mean will be at least , where k is an number greater than 1. "Within k standard deviations" interprets as the interval: to .

Chebyshev's Theorem is true for any sample set, not matter what the distribution.

Empirical Rule
The empirical rule is only valid for bell-shaped (normal) distributions. The following statements are true.

Approximately 68% of the data values fall within one standard deviation of the mean. Approximately 95% of the data values fall within two standard deviations of the mean. Approximately 99.7% of the data values fall within three standard deviations of the mean. Chi-square curve. The chi-square curve is a family of curves that depend on a parameter called degrees of freedom (d.f.). The chi-square curve is an approximation to the probability histogram of the chi-square statistic for multinomial model if the expected number of outcomes in each category is large. The chi-square curve is positive, and its total area is 100%, so we can think of it as the probability histogram of a random variable. The balance point of the curve is d.f., so the expected value of the corresponding random variable would equal d.f.. The standard error of the corresponding random variable would be (2d.f.). As d.f. grows, the shape of the chi-square curve approaches the shape of the normal curve. This page shows the chi-square curve. Chi-square Statistic. The chi-square statistic is used to measure the agreement between categorical data and a multinomial model that predicts the relative frequency of outcomes in each possible category. Suppose there are n independent trials, each of which can result in one of k possible outcomes. Suppose that in each trial, the probability that outcome i occurs is pi, for i = 1, 2, , k, and that these probabilities are the same in every trial. The expected number of times outcome 1 occurs in the n trials is np1; more generally, the expected number of times outcome i occurs is expectedi = npi. If the model be correct, we would expect the n trials to result in outcome i about npi times, give or take a bit. Let observedi denote the number of times an outcome of type i occurs in the n trials, for i = 1, 2, , k. The chi-squared statistic summarizes the discrepancies between the expected number of times each outcome occurs (assuming that the model is true) and the observed number of times each outcome occurs, by summing the squares of the discrepancies, normalized by the expected numbers, over all the categories: chi-squared = (observed1 expected1)2/expected1 + (observed2 expected2)2/expected2 + + (observedk expectedk)2/expectedk. As the sample size n increases, if the model is correct, the sampling distribution of the chi-squared statistic is approximated increasingly well by the chi-squared curve with (#categories 1) = k 1

You might also like