Professional Documents
Culture Documents
Some of The Individual Values
Some of The Individual Values
Some of The Individual Values
Definition
Weighted Mean is an average computed by giving different weights to some of
the individual values. If all the weights are equal, then the weighted mean is the
same as the arithmetic mean. Whereas weighted means generally behave in a
similar approach to arithmetic means, they do have a few counter instinctive
properties. Data elements with a high weight contribute more to the weighted
mean than do elements with a low weight. The weights cannot be negative. Some
may be zero, but not all of them; since division by zero is not allowed. Weighted
means play an important role in the systems of data analysis, weighted
differential and integral calculus
where
x is the repeating value
w is the number of occurrences of x (weight)
x̄ is the weighted mean
The collection of tools employs the study of methods and procedures used for
gathering, organizing, and analyzing data to understand theory of Probability
and Statistics. The set of ideas which is intended to offer the way for making
scientific implication from such resulting summarized data. In many applications
it is necessary to calculate the weighted mean for a set of data with different
individual errors. With this online weighted mean calculator you can effortlessly
make your calculation for the set of given observations
You probably have a good intuitive grasp of what the average of a data set says about that data set. In
this section we begin to learn what the standard deviation has to tell us about the nature of the data
set.
If we go through the data and count the number of observations that are within one standard deviation
of the mean, that is, that
are between 69.92−1.70=68.2269.92−1.70=68.22 and 69.92+1.70=71.6269.92+1.70=71.62 inches,
there are 6969 of them. If we count the number of observations that are within two standard deviations
of the mean, that is, that are
between 69.92−2(1.70)=66.5269.92−2(1.70)=66.52 and 69.92+2(1.70)=73.3269.92+2(1.70)=73.3
2 inches, there are 9595 of them. All of the measurements are within three standard deviations of the
mean, that is,
between 69.92−3(1.70)=64.82269.92−3(1.70)=64.822 and 69.92+3(1.70)=75.0269.92+3(1.70)=75
.02 inches. These tallies are not coincidences, but are in agreement with the following result that has
been found to be widely applicable.
Figure 2.5.12.5.1: Heights of Adult Men
approximately 95%95% of the data lie within two standard deviations of the mean, that is, in
the interval with endpoints x¯±2sx¯±2s for samples and with endpoints μ±2σμ±2σ for
populations; and
approximately 99.7%99.7% of the data lies within three standard deviations of the mean, that
is, in the interval with endpoints x¯±3sx¯±3s for samples and with endpoints μ±3σμ±3σ for
populations.
Figure 2.5.22.5.2: The Empirical Rule
Two key points in regard to the Empirical Rule are that the data distribution must be
approximately bell-shaped and that the percentages are only approximately true. The Empirical Rule
does not apply to data sets with severely asymmetric distributions, and the actual percentage of
observations in any of the intervals specified by the rule could be either greater or less than those
given in the rule. We see this with the example of the heights of the men: the Empirical Rule suggested
68 observations between 68.2268.22 and 71.6271.62inches, but we counted 6969.
EXAMPLE 2.5.12.5.1
Heights of 1818-year-old males have a bell-shaped distribution with mean 69.669.6 inches and
standard deviation 1.41.4 inches.
1. About what proportion of all such men are between 68.268.2 and 7171 inches tall?
2. What interval centered on the mean should contain about 95%95% of all such men?
Solution:
1. Since the interval from 68.268.2 to 71.071.0 has endpoints x¯−sx¯−s and x¯+sx¯+s, by the
Empirical Rule about 68%68% of all 1818-year-old males should have heights in this range.
2. By the Empirical Rule the shortest such interval has
endpoints x¯−2sx¯−2s and x¯+2sx¯+2s. Since
x¯−2s=69.6−2(1.4)=66.8(2.5.1)(2.5.1)x¯−2s=69.6−2(1.4)=66.8
and
x¯+2s=69.6+2(1.4)=72.4(2.5.2)(2.5.2)x¯+2s=69.6+2(1.4)=72.4
the interval in question is the interval from 66.866.8 inches to 72.472.4 inches.
Figure 2.5.32.5.3: Distribution of Heights
EXAMPLE 2.5.22.5.2
Scores on IQ tests have a bell-shaped distribution with mean μ=100μ=100 and standard
deviation σ=10σ=10. Discuss what the Empirical Rule implies concerning individuals with IQ scores
of 110110, 120120, and 130130.
Solution:
A sketch of the IQ distribution is given in Figure 2.5.32.5.3. The Empirical Rule states that
1. approximately 68%68% of the IQ scores in the population lie between 9090 and 110110,
2. approximately 95%95% of the IQ scores in the population lie between 8080 and 120120,
and
3. approximately 99.7%99.7% of the IQ scores in the population lie between 7070 and 130130.
Chebyshev’s Theorem
The Empirical Rule does not apply to all data sets, only to those that are bell-shaped, and even then is
stated in terms of approximations. A result that applies to every data set is known as Chebyshev’s
Theorem.
CHEBYSHEV’S THEOREM
For any numerical data set,
at least 3/43/4 of the data lie within two standard deviations of the mean, that is, in the interval
with endpoints x¯±2sx¯±2s for samples and with endpoints μ±2σμ±2σ for populations;
at least 8/98/9 of the data lie within three standard deviations of the mean, that is, in the
interval with endpoints x¯±3sx¯±3s for samples and with endpoints μ±3σμ±3σ for
populations;
at least 1−1/k21−1/k2 of the data lie within kk standard deviations of the mean, that is, in the
interval with endpoints x¯±ksx¯±ks for samples and with endpoints μ±kσμ±kσ for
populations, where kk is any positive whole number that is greater than 11.
It is important to pay careful attention to the words “at least” at the beginning of each of the three
parts of Chebyshev’s Theorem. The theorem gives the minimum proportion of the data which must lie
within a given number of standard deviations of the mean; the true proportions found within the
indicated regions could be greater than what the theorem guarantees.
EXAMPLE 2.5.32.5.3
A sample of size n=50n=50 has mean x¯=28x¯=28 and standard deviation s=3s=3. Without knowing
anything else about the sample, what can be said about the number of observations that lie in the
interval (22,34)(22,34)? What can be said about the number of observations that lie outside that
interval?
Solution:
The interval (22,34)(22,34) is the one that is formed by adding and subtracting two standard
deviations from the mean. By Chebyshev’s Theorem, at least 3/43/4 of the data are within this
interval. Since 3/43/4 of 5050 is 37.537.5, this means that at least 37.537.5 observations are in the
interval. But one cannot take a fractional observation, so we conclude that at least 3838observations
must lie inside the interval (22,34)(22,34).
If at least 3/43/4 of the observations are in the interval, then at most 1/41/4 of them are outside it.
Since 1/41/4 of 5050 is 12.512.5, at most 12.512.5 observations are outside the interval. Since again
a fraction of an observation is impossible, x(22,34)x(22,34).
EXAMPLE 2.5.42.5.4
The number of vehicles passing through a busy intersection
between 8:00a.m.8:00a.m. and 10:00a.m.10:00a.m. was observed and recorded on every weekday
morning of the last year. The data set contains n=251n=251 numbers. The sample mean
is x¯=725x¯=725 and the sample standard deviation is s=25s=25. Identify which of the following
statements must be true.
1. On approximately 95%95% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was
between 675675 and 775775.
2. On at least 75%75% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was
between 675675 and 775775.
3. On at least 189189 weekday mornings last year the number of vehicles passing through the
intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was
between 675675 and 775775.
4. On at most 25%25% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was either less
than 675675 or greater than 775775.
5. On at most 12.5%12.5% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was less than 675675.
6. On at most 25%25% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was less than 675675.
Solution:
1. Since it is not stated that the relative frequency histogram of the data is bell-shaped, the
Empirical Rule does not apply. Statement (1) is based on the Empirical Rule and therefore it
might not be correct.
2. Statement (2) is a direct application of part (1) of Chebyshev’s Theorem
because x¯−2sx¯−2s, x¯+2s=(675,775)x¯+2s=(675,775). It must be correct.
3. Statement (3) says the same thing as statement (2)
because 75%75% of 251251 is 188.25188.25, so the minimum whole number of observations
in this interval is 189189. Thus statement (3) is definitely correct.
4. Statement (4) says the same thing as statement (2) but in different words, and therefore is
definitely correct.
5. Statement (4), which is definitely correct, states that at most 25%25% of the time either fewer
than 675675 or more than 775775 vehicles passed through the intersection. Statement (5) says
that half of that 25%25% corresponds to days of light traffic. This would be correct if the
relative frequency histogram of the data were known to be symmetric. But this is not stated;
perhaps all of the observations outside the interval (675,775675,775) are less than 7575. Thus
statement (5) might not be correct.
6. Statement (4) is definitely correct and statement (4) implies statement (6): even if every
measurement that is outside the interval (675,775675,775) is less than 675675 (which is
conceivable, since symmetry is not known to hold), even so at most 25%25% of all
observations are less than 675675. Thus statement (6) must definitely be correct.
KEY TAKEAWAY
The Empirical Rule is an approximation that applies only to data sets with a bell-shaped
relative frequency histogram. It estimates the proportion of the measurements that lie within
one, two, and three standard deviations of the mean.
Chebyshev’s Theorem is a fact that applies to all possible data sets. It describes the minimum
proportion of the measurements that lie must within one, two, or more standard deviations of
the mean.
Contributor
Anonymous
1. Back to top
2.
o 2.4: Relative Position of Data
Recommended articles
1. 7.1: Large Sample Estimation of a Population MeanA confidence interval for a population mean is
an estimate of the population mean together with an indication of reliability. There are different
form...
2. 2.E: Descriptive Statistics (Exercises)These are homework exercises to accompany the Textmap
created for "Introductory Statistics" by Shafer and Zhang. Complementary General Chemistry
quest...
3. 2.1: Three Popular Data DisplaysGraphical representations of large data sets provide a quick
overview of the nature of the data. A population or a very large data set may be repr...
4. 2.2: Measures of Central LocationThe mean, the median, and the mode each answer the question
“Where is the center of the data set?” The nature of the data set, as indicated by a relat...
5. 2.3: Measures of VariabilityThe range, the standard deviation, and the variance each give a
quantitative answer to the question “How variable are the data?”
1. The LibreTexts libraries are Powered by MindTouch® and are based upon work supported
by the National Science Foundation under grant numbers: 1246120, 1525057, and
1413739. Unless otherwise noted, the LibreTexts library is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States License. Permissions
beyond the scope of this license may be available at copyright@ucdavis.edu.
When you take an exam, what is often as important as your actual score on the exam is the way your
score compares to other students’ performance. If you made a 7070 but the average score (whether
the mean, median, or mode) was 8585, you did relatively poorly. If you made a 7070 but the average
score was only 5555 then you did relatively well. In general, the significance of one observed value
in a data set strongly depends on how that value compares to the other observed values in a data set.
Therefore we wish to attach to each observed value a number that measures its relative position.
EXAMPLE 2.4.12.4.1
What percentile is the value 1.391.39 in the data set of ten GPAs considered in a previous Example?
What percentile is the value 3.333.33?
Solution:
1.391.761.902.122.532.713.003.333.714.00(2.4.1)(2.4.1)1.391.761.902.122.532.713.003.333.714.
00
The only data value that is less than or equal to 1.391.39 is 1.391.39 itself. Since 11 out of ten,
or 1/10=10%1/10=10%, of the data points are less than or equal to 1.391.39, 1.391.39 is
the 10th10th percentile. Eight data values are less than or equal to 3.333.33. Since 88 out of ten,
or 8∕10=.80=80%8∕10=.80=80% of the data values are less than or equal to 3.333.33, the
value 3.333.33 is the 80th80th percentile of the data.
The Pth percentile cuts the data set in two so that approximately P%P% of the data lie below it
and (100−P)%(100−P)% of the data lie above it. In particular, the three percentiles that cut the data
into fourths, as shown in Figure 2.4.12.4.1, are called the quartiles of a data set. The quartiles are the
three numbers Q1Q1, Q2Q2, Q3Q3 that divide the data set approximately into fourths. The following
simple computational definition of the three quartiles works well in practice.
DEFINITION: QUARTILE
For any data set:
EXAMPLE 2.4.22.4.2
Find the quartiles of the data set of GPAs of discussed in a previous Example.
Solution:
1.391.761.902.122.532.713.003.333.714.00(2.4.2)(2.4.2)1.391.761.902.122.532.713.003.333.714.
00
This data set has n=10n=10 observations. Since 1010 is an even number, the median is the mean of
the two middle observations: x~=(2.53+2.71)∕2=2.62x~=(2.53+2.71)∕2=2.62. Thus the second
quartile is Q2=2.62Q2=2.62. The lower and upper subsets are
Each has an odd number of elements, so the median of each is its middle observation. Thus the first
quartile is Q1=1.90Q1=1.90, the median of LL, and the third quartile is Q3=3.33Q3=3.33, the median
of UU.
EXAMPLE 2.4.32.4.3
Adjoin the observation 3.883.88 to the data set of the previous example and find the quartiles of the
new set of data.
Solution:
1.391.761.902.122.532.713.003.333.713.88,4.00(2.4.3)(2.4.3)1.391.761.902.122.532.713.003.333
.713.88,4.00
This data set has 1111 observations. The second quartile is its median, the middle value 2.712.71.
Thus Q2=2.71Q2=2.71. The lower and upper subsets are now
Lower: L={1.39,1.76,1.90,2.12,2.53}L={1.39,1.76,1.90,2.12,2.53}
Upper: U={3.00,3.33,3.71,3.88,4.00}U={3.00,3.33,3.71,3.88,4.00}.
The lower set LL has median, the middle value, 1.901.90, so Q1=1.90Q1=1.90. The upper set has
median 3.713.71, so Q3=3.71Q3=3.71.
In addition to the three quartiles, the two extreme values, the minimum xminxmin and the
maximum xmaxxmax are also useful in describing the entire data set. Together these five numbers are
called the five-number summary of a data set,
{ XminXmin, Q1Q1, Q2Q2, Q3Q3, XmaxXmax }
The five-number summary is used to construct a box plot, as in Figure 2.4.22.4.2. Each of the five
numbers is represented by a vertical line segment, a box is formed using the line segments
at Q1Q1 and Q3Q3 as its two vertical sides, and two horizontal line segments are extended from the
vertical segments marking Q1Q1 and Q3Q3 to the adjacent extreme values. (The two horizontal line
segments are referred to as “whiskers,” and the diagram is sometimes called a "box and whisker plot.")
We caution the reader that there are other types of box plots that differ somewhat from the ones we
are constructing, although all are based on the three quartiles.
Note that the distance from Q1Q1 to Q3Q3 is the length of the interval over which the middle half of
the data range. Thus it has the following special name.
DEFINITION: INTERQUARTILE
The interquartile range IQRIQR is the quantity
IQR=Q3−Q1(2.4.4)(2.4.4)IQR=Q3−Q1
EXAMPLE 2.4.42.4.4
Construct a box plot and find the IQRIQR for the data in Example 2.4.32.4.3.
Solution:
From our work in Example 2.4.12.4.1, we know that the five-number summary is
Xmin=1.39Xmin=1.39
Q1=1.90Q1=1.90
\[Q2 = 2.62\)
Q3=3.33Q3=3.33
Xmax=4.00Xmax=4.00
zz-Scores
Another way to locate a particular observation xx in a data set is to compute its distance from the
mean in units of standard deviation. The zz-score indicates how many standard deviations an
individual observation xx is from the center of the data set, its mean. It is used on distributions that
have been standardized, which allows us to better understand its properties. If zz is negative then xx is
below average. If zz is 00 then xx is equal to the average. If zz is positive then xx is above the average
DEFINITION: zz-SCORE
The zz-score of an observation xx is the number zz given by the computational formula
z=x−μσ(2.4.5)(2.4.5)z=x−μσ
Figure 2.4.32.4.3: x-Scale versus z-Score
EXAMPLE 2.4.52.4.5
Suppose the mean and standard deviation of the GPA's of all currently registered students at a college
are μ=2.70μ=2.70 and σ=0.50σ=0.50. The zz-scores of the GPA's of two students, Antonio and
Beatrice, are z=−0.62z=−0.62 and z=1.28z=1.28, respectively. What are their GPAs?
Solution:
Using the second formula right after the definition of zz-scores we compute the GPA's as
Antonio: x=μ+zσ=2.70+(−0.62)(0.50)=2.39x=μ+zσ=2.70+(−0.62)(0.50)=2.39
Beatrice: x=μ+zσ=2.70+(1.28)(0.50)=3.34x=μ+zσ=2.70+(1.28)(0.50)=3.34
KEY TAKEAWAYS
The percentile rank and zz-score of a measurement indicate its relative position with regard
to the other measurements in a data set.
The three quartiles divide a data set into fourths.
The five-number summary and its associated box plot summarize the location and distribution
of the data.
Contributor
Anonymous
1. Back to top
2.
o 2.3: Measures of Variability
Recommended articles
1. 1.7: PercentilesA test score in and of itself is usually difficult to interpret. For example, if you
learned that your score on a measure of shyness was 35 out of a p...
2. 2.4: Measures of the Location of the DataThe values that divide a rank-ordered set of data into 100
equal parts are called percentiles and are used to compare and interpret data. For example,...
3. 2.6: Box Plots
4. 1.6: Examining Numerical DataIn this section we will be introduced to techniques for exploring and
summarizing numerical variables. Recall that outcomes of numerical variables are...
5. 2.5: Box PlotsBox plots are a type of graph that can help visually organize data. To graph a box plot
the following data points must be calculated: the minimum valu...
1. The LibreTexts libraries are Powered by MindTouch® and are based upon work supported
by the National Science Foundation under grant numbers: 1246120, 1525057, and
1413739. Unless otherwise noted, the LibreTexts library is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States License. Permissions
beyond the scope of this license may be available at copyright@ucdavis.edu.
The image above is the technical formula for the weighted mean. In simple terms, the formula can be written as:
Weighted mean = Σwx/Σw
Σ = the sum of (in other words…add them up!).
w = the weights.
x = the value.
To use the formula:
Regards,
R.Saran
9943041432
The initial scores per the question statement were 80, 80 and 85.
“Sample problem: You take three 100-point exams in your statistics class and score 80, 80 and 85. ”
Parameters
Population mean = μ = ( Σ Xi ) / N
Population variance = σ2 = Σ ( Xi - μ )2 / N
Standardized score = Z = (X - μ) / σ
Statistics
Sample mean = x = ( Σ xi ) / n
Sample variance = s2 = Σ ( xi - x )2 / ( n - 1 )
Pooled sample standard deviation = sp = sqrt [ (n1 - 1) * s12 + (n2 - 1) * s22 ] / (n1 + n2 - 2) ]
Standard error of regression slope = sb1 = sqrt [ Σ(yi - ŷi)2 / (n - 2) ] / sqrt [ Σ(xi - x)2 ]
Counting
Probability
Random Variables
In the following formulas, X and Y are random variables, and a and b are constants.
Chi-square statistic = Χ2 = [ ( n - 1 ) * s2 ] / σ2
Variance of the difference between independent random variables = Var(X - Y) = Var(X) + Var(Y)
Sampling Distributions
Standard deviation of difference of sample means = σ d = sqrt[ (σ12 / n1) + (σ22 / n2) ]
Standard deviation of difference of sample proportions = σ d = sqrt{ [P1(1 - P1) / n1] + [P2(1 - P2) / n2] }
Standard Error
Standard error of difference of sample means = SEd = sd = sqrt[ (s12 / n1) + (s22 / n2) ]
Standard error of difference of paired sample means = SEd = sd = { sqrt [ (Σ(di - d)2 / (n - 1) ] } / sqrt(n)
Pooled sample standard error = spooled = sqrt [ (n1 - 1) * s12 + (n2 - 1) * s22 ] / (n1 + n2 - 2) ]
Standard error of difference of sample proportions = sd = sqrt{ [p1(1 - p1) / n1] + [p2(1 - p2) / n2] }
Linear Transformations
For the following formulas, assume that Y is a linear transformation of the random variable X, defined by the
equation: Y = aX + b.
Estimation
Hypothesis Testing
Degrees of Freedom
The correct formula for degrees of freedom (DF) depends on the situation (the nature of the test statistic, the
number of samples, underlying assumptions, etc.).
One-sample t-test: DF = n - 1
Two-sample t-test: DF = (s12/n1 + s22/n2)2 / { [ (s12 / n1)2 / (n1 - 1) ] + [ (s22 / n2)2 / (n2 - 1) ] }
Sample Size
Below, the first two formulas find the smallest sample sizes required to achieve a fixed margin of error, using
simple random sampling. The third formula assigns sample to strata, based on a proportionate design. The
fourth formula, Neyman allocation, uses stratified sampling to minimize variance, given a fixed sample size.
And the last formula, optimum allocation, uses stratified sampling to minimize variance, given a fixed budget.
Statistics Tutorial
The term population mean, which is the average score of the population on a given variable, is
represented by:
μ = ( Σ Xi ) / N
The symbol ‘μ’ represents the population mean. The symbol ‘Σ Xi’ represents the sum of all scores
present in the population (say, in this case) X1 X2 X3 and so on. The symbol ‘N’ represents the total
number of individuals or cases in the population.
Population Standard Deviation
The population standard deviation is a measure of the spread (variability) of the scores on a given
variable and is represented by:
σ = sqrt[ Σ ( Xi – μ )2 / N ]
The symbol ‘σ’ represents the population standard deviation. The term ‘sqrt’ used in this statistical
formula denotes square root. The term ‘Σ ( Xi – μ )2’ used in the statistical formula represents the sum of
the squared deviations of the scores from their population mean.
Population Variance
The population variance is the square of the population standard deviation and is represented by:
σ2 = Σ ( Xi – μ )2 / N
The symbol ‘σ2’ represents the population variance.
Sample Mean
The sample mean is the average score of a sample on a given variable and is represented by:
x_bar = ( Σ xi ) / n
The term “x_bar” represents the sample mean. The symbol ‘Σ xi’ used in this formula represents the
represents the sum of all scores present in the sample (say, in this case) x1 x2 x3 and so on. The symbol
‘n,’ represents the total number of individuals or observations in the sample.
Sample Standard Deviation
The statistic called sample standard deviation, is a measure of the spread (variability) of the scores in the
sample on a given variable and is represented by:
s = sqrt [ Σ ( xi – x_bar )2 / ( n – 1 ) ]
The term ‘Σ ( xi – x_bar )2’ represents the sum of the squared deviations of the scores from the sample
mean.
Sample Variance
The sample variance is the square of the sample standard deviation and is represented by:
s2 = Σ ( xi – x_bar )2 / ( n – 1 )
The symbol ‘s2’ represents the sample variance.
Pooled Sample Standard Deviation
The pooled sample standard deviation is a weighted estimate of spread (variability) across multiple
samples. It is represented by:
The Sample Size/Power Analysis Calculator with Write-up is a tool for anyone struggling with power
analysis. Simply identify the test to be conducted and the degrees of freedom where applicable
(explained in the document), and the sample size/power analysis calculator will calculate your sample
size for a power of .80 of an alpha of .05 for small, medium and large effect sizes. The calculator then
presents the write-up with references which can easily be integrated in your dissertation
document. Click here for a sa
You probably have a good intuitive grasp of what the average of a data set says about that data set. In
this section we begin to learn what the standard deviation has to tell us about the nature of the data
set.
If we go through the data and count the number of observations that are within one standard deviation
of the mean, that is, that
are between 69.92−1.70=68.2269.92−1.70=68.22 and 69.92+1.70=71.6269.92+1.70=71.62 inches,
there are 6969 of them. If we count the number of observations that are within two standard deviations
of the mean, that is, that are
between 69.92−2(1.70)=66.5269.92−2(1.70)=66.52 and 69.92+2(1.70)=73.3269.92+2(1.70)=73.3
2 inches, there are 9595 of them. All of the measurements are within three standard deviations of the
mean, that is,
between 69.92−3(1.70)=64.82269.92−3(1.70)=64.822 and 69.92+3(1.70)=75.0269.92+3(1.70)=75
.02 inches. These tallies are not coincidences, but are in agreement with the following result that has
been found to be widely applicable.
Figure 2.5.12.5.1: Heights of Adult Men
approximately 95%95% of the data lie within two standard deviations of the mean, that is, in
the interval with endpoints x¯±2sx¯±2s for samples and with endpoints μ±2σμ±2σ for
populations; and
approximately 99.7%99.7% of the data lies within three standard deviations of the mean, that
is, in the interval with endpoints x¯±3sx¯±3s for samples and with endpoints μ±3σμ±3σ for
populations.
Figure 2.5.22.5.2: The Empirical Rule
Two key points in regard to the Empirical Rule are that the data distribution must be
approximately bell-shaped and that the percentages are only approximately true. The Empirical Rule
does not apply to data sets with severely asymmetric distributions, and the actual percentage of
observations in any of the intervals specified by the rule could be either greater or less than those
given in the rule. We see this with the example of the heights of the men: the Empirical Rule suggested
68 observations between 68.2268.22 and 71.6271.62inches, but we counted 6969.
EXAMPLE 2.5.12.5.1
Heights of 1818-year-old males have a bell-shaped distribution with mean 69.669.6 inches and
standard deviation 1.41.4 inches.
1. About what proportion of all such men are between 68.268.2 and 7171 inches tall?
2. What interval centered on the mean should contain about 95%95% of all such men?
Solution:
1. Since the interval from 68.268.2 to 71.071.0 has endpoints x¯−sx¯−s and x¯+sx¯+s, by the
Empirical Rule about 68%68% of all 1818-year-old males should have heights in this range.
2. By the Empirical Rule the shortest such interval has
endpoints x¯−2sx¯−2s and x¯+2sx¯+2s. Since
x¯−2s=69.6−2(1.4)=66.8(2.5.1)(2.5.1)x¯−2s=69.6−2(1.4)=66.8
and
x¯+2s=69.6+2(1.4)=72.4(2.5.2)(2.5.2)x¯+2s=69.6+2(1.4)=72.4
the interval in question is the interval from 66.866.8 inches to 72.472.4 inches.
Figure 2.5.32.5.3: Distribution of Heights
EXAMPLE 2.5.22.5.2
Scores on IQ tests have a bell-shaped distribution with mean μ=100μ=100 and standard
deviation σ=10σ=10. Discuss what the Empirical Rule implies concerning individuals with IQ scores
of 110110, 120120, and 130130.
Solution:
A sketch of the IQ distribution is given in Figure 2.5.32.5.3. The Empirical Rule states that
1. approximately 68%68% of the IQ scores in the population lie between 9090 and 110110,
2. approximately 95%95% of the IQ scores in the population lie between 8080 and 120120,
and
3. approximately 99.7%99.7% of the IQ scores in the population lie between 7070 and 130130.
Chebyshev’s Theorem
The Empirical Rule does not apply to all data sets, only to those that are bell-shaped, and even then is
stated in terms of approximations. A result that applies to every data set is known as Chebyshev’s
Theorem.
CHEBYSHEV’S THEOREM
For any numerical data set,
at least 3/43/4 of the data lie within two standard deviations of the mean, that is, in the interval
with endpoints x¯±2sx¯±2s for samples and with endpoints μ±2σμ±2σ for populations;
at least 8/98/9 of the data lie within three standard deviations of the mean, that is, in the
interval with endpoints x¯±3sx¯±3s for samples and with endpoints μ±3σμ±3σ for
populations;
at least 1−1/k21−1/k2 of the data lie within kk standard deviations of the mean, that is, in the
interval with endpoints x¯±ksx¯±ks for samples and with endpoints μ±kσμ±kσ for
populations, where kk is any positive whole number that is greater than 11.
It is important to pay careful attention to the words “at least” at the beginning of each of the three
parts of Chebyshev’s Theorem. The theorem gives the minimum proportion of the data which must lie
within a given number of standard deviations of the mean; the true proportions found within the
indicated regions could be greater than what the theorem guarantees.
EXAMPLE 2.5.32.5.3
A sample of size n=50n=50 has mean x¯=28x¯=28 and standard deviation s=3s=3. Without knowing
anything else about the sample, what can be said about the number of observations that lie in the
interval (22,34)(22,34)? What can be said about the number of observations that lie outside that
interval?
Solution:
The interval (22,34)(22,34) is the one that is formed by adding and subtracting two standard
deviations from the mean. By Chebyshev’s Theorem, at least 3/43/4 of the data are within this
interval. Since 3/43/4 of 5050 is 37.537.5, this means that at least 37.537.5 observations are in the
interval. But one cannot take a fractional observation, so we conclude that at least 3838observations
must lie inside the interval (22,34)(22,34).
If at least 3/43/4 of the observations are in the interval, then at most 1/41/4 of them are outside it.
Since 1/41/4 of 5050 is 12.512.5, at most 12.512.5 observations are outside the interval. Since again
a fraction of an observation is impossible, x(22,34)x(22,34).
EXAMPLE 2.5.42.5.4
The number of vehicles passing through a busy intersection
between 8:00a.m.8:00a.m. and 10:00a.m.10:00a.m. was observed and recorded on every weekday
morning of the last year. The data set contains n=251n=251 numbers. The sample mean
is x¯=725x¯=725 and the sample standard deviation is s=25s=25. Identify which of the following
statements must be true.
1. On approximately 95%95% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was
between 675675 and 775775.
2. On at least 75%75% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was
between 675675 and 775775.
3. On at least 189189 weekday mornings last year the number of vehicles passing through the
intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was
between 675675 and 775775.
4. On at most 25%25% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was either less
than 675675 or greater than 775775.
5. On at most 12.5%12.5% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was less than 675675.
6. On at most 25%25% of the weekday mornings last year the number of vehicles passing
through the intersection from 8:00a.m.8:00a.m. to 10:00a.m.10:00a.m. was less than 675675.
Solution:
1. Since it is not stated that the relative frequency histogram of the data is bell-shaped, the
Empirical Rule does not apply. Statement (1) is based on the Empirical Rule and therefore it
might not be correct.
2. Statement (2) is a direct application of part (1) of Chebyshev’s Theorem
because x¯−2sx¯−2s, x¯+2s=(675,775)x¯+2s=(675,775). It must be correct.
3. Statement (3) says the same thing as statement (2)
because 75%75% of 251251 is 188.25188.25, so the minimum whole number of observations
in this interval is 189189. Thus statement (3) is definitely correct.
4. Statement (4) says the same thing as statement (2) but in different words, and therefore is
definitely correct.
5. Statement (4), which is definitely correct, states that at most 25%25% of the time either fewer
than 675675 or more than 775775 vehicles passed through the intersection. Statement (5) says
that half of that 25%25% corresponds to days of light traffic. This would be correct if the
relative frequency histogram of the data were known to be symmetric. But this is not stated;
perhaps all of the observations outside the interval (675,775675,775) are less than 7575. Thus
statement (5) might not be correct.
6. Statement (4) is definitely correct and statement (4) implies statement (6): even if every
measurement that is outside the interval (675,775675,775) is less than 675675 (which is
conceivable, since symmetry is not known to hold), even so at most 25%25% of all
observations are less than 675675. Thus statement (6) must definitely be correct.
KEY TAKEAWAY
The Empirical Rule is an approximation that applies only to data sets with a bell-shaped
relative frequency histogram. It estimates the proportion of the measurements that lie within
one, two, and three standard deviations of the mean.
Chebyshev’s Theorem is a fact that applies to all possible data sets. It describes the minimum
proportion of the measurements that lie must within one, two, or more standard deviations of
the mean.
Contributor
Anonymous
1. Back to top
2.
o 2.4: Relative Position of Data
Recommended articles
1. 7.1: Large Sample Estimation of a Population MeanA confidence interval for a population mean is
an estimate of the population mean together with an indication of reliability. There are different
form...
2. 2.E: Descriptive Statistics (Exercises)These are homework exercises to accompany the Textmap
created for "Introductory Statistics" by Shafer and Zhang. Complementary General Chemistry
quest...
3. 2.1: Three Popular Data DisplaysGraphical representations of large data sets provide a quick
overview of the nature of the data. A population or a very large data set may be repr...
4. 2.2: Measures of Central LocationThe mean, the median, and the mode each answer the question
“Where is the center of the data set?” The nature of the data set, as indicated by a relat...
5. 2.3: Measures of VariabilityThe range, the standard deviation, and the variance each give a
quantitative answer to the question “How variable are the data?”
1. The LibreTexts libraries are Powered by MindTouch® and are based upon work supported
by the National Science Foundation under grant numbers: 1246120, 1525057, and
1413739. Unless otherwise noted, the LibreTexts library is licensed under a Creative
Commons Attribution-Noncommercial-Share Alike 3.0 United States License. Permissions
beyond the scope of this license may be available at copyright@ucdavis.edu.