Professional Documents
Culture Documents
Frequency Distribution
Frequency Distribution
97, 92, 88, 75, 83, 67, 89, 55, 72, 78, 81, 91, 57, 63, 67, 74, 87, 84, 98, 46
Class Frequency ( f )
90-99 4
80-89 6
70-79 4
60-69 3
50-59 2
40-49 1
Note that the sum of the frequency column is equal to 20, the number of test
scores.
1
Additional Terminology:
Lower Class Limit – The least value that can belong to a class.
2
The range is: The width of each class is:
Construct a frequency table (in ascending order) with 6 data classes from
the following data set. (Leave space, we will be adding to this.)
3
Mathematical Notation:
In this course, the following symbols and variables will have the meanings given
below (unless otherwise specified.)
Variables
x = data value
n = number of values in a sample data set
N = number of values in a population data set
f = frequency of a data class
Symbol
= the sum of all values for the following variable or
expression.
Ex: Using our notation, we can write the statement that the sum of the
frequencies in a frequency table should equal the number of values in the
sample data set as follows:
4
Cumulative Frequency:
Ex:
Notice that the last entry in the cumulative frequency column is n = 20.
5
Relative Frequency:
f
relative frequency =
n
Ex:
6
DESCRIBING DATA SETS:
Stem and Leaf Plots: maintain the exact data (section 2.2)
7
Histograms:
The histogram should have the variable being measured in the data set as its
horizontal axis, and the class frequency as the vertical axis. Each data class
will be represented by a vertical bar whose height is the frequency of the class
and whose width is the class width.
Example: Created in Excel from the data used in the previous examples.
7
6
5
4
3
2
1
0
44.5 54.5 64.5 74.5 84.5 94.5
Notice that the bar for each class is centered at the class midpoint, and the bars
for successive classes touch.
Class Exercise:
Construct a histogram for the frequency table of gasoline purchases.
8
Frequency Polygon:
Like a histogram, the vertical axis represents frequency and the horizontal axis
represents the variable being measured in the data set. To construct the graph, a
point is plotted for each class at its midpoint and with height given by the
frequency of the class. The points are then connected by straight lines.
Ex: Created in Excel using the same data as in the previous examples.
7
6
5
4
3
2
1
0
44.5 54.5 64.5 74.5 84.5 94.5
Class Exercise:
Construct a frequency polygon from the gasoline purchase frequency table.
Now construct a cumulative frequency polygon (ogive) from the gasoline
purchase frequency table. What does the slope of the line segments tell you in
either case? What does a line segment with zero slope (“flat”) tell you in a
cumulative frequency polygon?
9
2.2 More Graphs and Displays:
A stem and leaf plot reports the exact data by the leftmost
number(s) being part of the stem, and the rightmost number(s)
being the leaves.
Ex: Given the previous statistics exam grades for 20 statistics students, let us
create a stem and leaf plot.
97, 92, 88, 75, 83, 67, 89, 55, 72, 78, 81, 91, 57, 63, 67, 74, 87, 84, 98, 46
10
2.3 Measures of Central Tendency:
x
x x
n N
Exercise: Find the mean of the following data set:
Exercise: Find the median value for the set of quiz scores.
11
Mode – the data value (or values) which appears the largest
number of times in the set. If no data value is repeated, we say that there is
no mode.
Exercise: Find the mode(s) of the quiz score data set.
Outlier – a data entry far removed from the other entries in the data set.
Properties of Mean, Median, and Mode:
One drawback of the mean is that it is heavily influenced by a few very high
or very low data values (outliers or skewedness). In these cases it is more
common to use the median.
The mode has the advantage that it can be used to measure data sets even if
they contain only qualitative data. A disadvantage is that a data set may
not have a mode.
Weighted Means:
12
Ex: Gradepoint average. We assign the letter grades the number values A=4,
B=3, C=2, D=1, F=0, and then each grade value is counted into the GPA
according to the number of credits earned (course’s weight) with that course
grade.
Calculate the GPA of a student who has earned 12 credits of A’s, 21 credits
of B’s, 5 credits of C’s and 3 credits of D’s.
Calculate the final score for a student who has scored 95 on quizzes, has
exam scores of 83, 94, and 77, and a final exam score of 88.
70, 80, 80, 80, 80, 80, 80, 80, 80, 80, 90, 100, 100
13
Estimating a Mean from a Frequency Table:
Given the frequency distribution of a data set, we can make the best estimate of
the mean for the data set by using a weighted mean.
lower upper
1. Calculate the class midpoint for each data class. 2
(Our data
values for calculating the weighted mean.)
2. Use the frequency of the data class as the weight for each data class.
x
(x f )mid
f
Exercise: Estimate the mean of the data set whose frequency distribution is
given by:
Class Frequency ( f )
90-99 4
80-89 6
70-79 4
60-69 3
50-59 2
40-49 1
14
Shapes of Data Distributions:
The mean and median (and mode if unimodal) are equal in a symmetric
distribution.
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9
14
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9
15
Right-Skewed – A few data values are much higher than the
majority of values in the set. (Tail extends to the right)
Generally the mean is greater (to the right) than the median (and mode) in a
right-skewed distribution.
12
10
8
6
4
2
0
1 2 3 4 5 6 7 8 9
10
0
1 2 3 4 5 6
16
2.4 (MEASURES OF DISPERSION) Variation:
In a data set with little variation, almost all data values would be close to one
another. The histogram of such a data set would be narrow and tall.
In a data set with a great deal of variation, the data values would be spread
widely. The histogram of this data set would be low and wide.
1. Range – the difference between the largest and smallest data values
in a data set.
range highest value lowest value
Standard deviation is calculated using two different formulae depending on whether the data
set being considered is a population data set or a sample data set.
(x ) 2
s
(x x ) 2
n 1
17
3. Variance – the square of the standard deviation. Population variance is
represented by 2 and sample variance by s 2
18
Calculating Standard Deviation Using the Formula:
2. Subtract the mean from each data value in the set. These values are called the
deviations of the data values.
5. Divide the sum from Step 4 by the population size for population standard
deviation or the sample size minus 1 for sample standard deviation.
Exercise: Find the range and standard deviation of the data set of quiz
scores used in the previous example:
Quiz Scores: 1, 5, 7, 7, 6, 8, 10, 9, 5, 10, 8
19
Estimating Standard Deviation using a Frequency Table (“Grouped Data”):
Given the frequency distribution of a data set, we can make the best estimate of
the standard deviation for the data set by using the same technique as for mean.
1. Calculate the class midpoint for each data class. These will be our data values
for calculating the standard deviation.
2. Use the frequency of the data class as the weight for each data class midpoint.
(That is, multiply by the frequency rather than having to sum that many times.)
s
(x mid x )2 f
(sample) OR S
(x
mid )2 f
(population)
n 1 N
Exercise: Estimate the standard deviation of the data set whose frequency
distribution is given by:
Class Frequency ( f )
90-99 4
80-89 6
70-79 4
60-69 3
50-59 2
40-49 1
20
Using the TI-83 for Mean, Median & Standard Deviation:
Press [STAT]. A menu will appear in which EDIT is selected, choose EDIT by pressing [ENTER]
Enter the data values into one of the lists L1, L2, etc.
Use the arrow keys or press enter to enter the next value in the list.
Step 2: Calculate
Press [STAT]. Use the arrow keys to highlight the CALC menu.
Select the first entry 1-Var Stats from the CALC menu by pressing [ENTER]
On your screen 1-Var Stats will appear. If the data values are in L1, press enter to calculate. If they
are in another list, say L3, then you will need to select L3 first by pressing [2nd][3] and then press
[ENTER] to calculate.
The estimated mean, median, and standard deviation for data in a frequency table can be calculated using
the TI-83 as follows:
Step 2: Calculate
Choose 1-Var Stats from the CALC menu as before and when it appears on the screen choose L1
comma L2:
1-Var Stats L1, L2
Press [ENTER] to calculate.
21
Theorems Involving Standard Deviation:
The standard deviation of a data set is an important quantity because it limits the
number of data values that can be very far (high or low) from average.
Chebychev’s Theorem
Applies to any data set.
The portion (%) of data values that must be within k ( for k>1) standard
1
deviations of the mean is at minimum: 1 k 2
Ex: Lets try k=2, 3, 4, 5. What happens as k increases? (See p83 ex8.)
22
Note: Chebychev’s Theorem gives only cautious lower bounds for the proportion of data
values, whereas the Empirical Rule gives approximations. If a data distribution is known to
be bell-shaped, the Empirical Rule should be used.
2.5 Measures of Position:
Fractiles divide a data set into consecutive intervals so that each interval has (at
least approximately) the same number of data values. The most common fractiles
are:
Percentiles – divide a data set into 100 parts. For example, the
th
36 percentile is the value which separates the lowest 36% of data values
from the highest 64% of data values and is denoted by P36.
Quartiles – divide a data set into fourths. For example, the first
quartile Q1 divides the lowest quarter of a data set from the upper three
quarters.
Deciles – divide a data set into 10 parts. For example, the 7th
decile is the value which separates the lowest 7/10 of data values from the
highest 3/10 of data values and is denoted D7.
Note: There are 99 percentiles P1-P99, 3 quartiles Q1-Q3, and 9 deciles D1-D9.
Ex: Using the quiz scores from before put them in order, then find and
interpret Q1-Q3.
A Box (and Whisker) Plot illustrates the range, Q1, Q2 (median) and Q3. Lets draw one and
discuss it.
(We can also do all of this in our calculators, anyone interested?) (p 125)
23
Ex: If your doctor tells you your 3 year old is in the 50th percentile for height
and the 35th percentile for weight, what does that mean?
The standard score (or z-score) of a data value is the number of standard
deviations that the value lies above or below the mean for a bell-shaped
distribution.
Think about it. The larger the z-score, the ____________________the mean.
The ______________________the mean, the ___________ the percentage of
data between the mean and that z-score.
6 feet tall
5’1/2’’
6’3’’
Now find the percentile for the last two men above.
Note: The z-score of a value is positive if the value is above the mean and
negative if it is below the mean. The mean itself always has a z-score of _____.
24
Think about the Empirical Rule and Chebychev’s Theorem. Why does this make
sense?
Ex: p112 #32, 36 (look at “uses & abuses” charts p 115)
25