Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 73

Stem and leaf diagrams

Eg the times, in seconds, of 18 competitors to complete a ski course are:

51, 63, 84, 64, 55, 63, 70, 81, 73, 51, 82, 58, 62, 65, 69, 81, 73, 79

Construct an ordered stem and leaf diagram to show this data

Stem Leaf Stem Leaf


5 1 5 1 8 5 1 1 5 8
6 3 4 3 2 5 9 6 2 3 3 4 5 9
7 0 3 3 9 7 0 3 3 9
8 4 1 2 1 8 1 1 2 4

First construct an unordered diagram Key: 5│3 means 53 seconds

Then resort into size order Provide a key


Box plots
Box plots are an effective way to present a set of data so that you can
see its average and spread, get a sense of how the data is distributed
and possibly compare it with another set of data.
Draw a ‘ruler scale’ extending as far Eg Salary data from previous page
as the minimum & maximum values
Minimum 10
The box corresponds with the lower
Lower Quartile 24
& upper quartiles as its edges and
the median as the line through it Median 31
Upper Quartile 40
The ‘whiskers’ correspond to the
minimum & maximum values Maximum 100
In S1 you use calculations that measure:
Average a single value that is ‘representative’ of the data

Spread a single value that indicates how wide-ranging the data are

Relative dispersion how the average compares to the spread

Skewness whether data tends towards the lowest or highest value in the distribution

Outliers whether data are so far from the average that you disregard them

You must be able to apply these calculations to both discrete and


continuous data, presented as a list, in a frequency distribution or in groups

You also are examined on your ability to interpret these calculations


– what do they tell you about the distribution of the data?

In my experience pupils consistently under-estimate S1, believing it to be easy and that


they did not need to be able to interpret data, only to ‘do the Maths bit’ (ie calculations).
They are wrong! And it pains me to admit that exam marks in previous years prove this
to be true…the average result over the past 3 years is 63%, a low C grade
Notation
In S1, the symbol used for mean is μ
the symbol Σ (capital ‘sigma’) means ‘the sum of’

the data values are referred to as x or xi

the number of times a data value occurs, or frequency, is referred to as f


the total number of data values are referred to as n or Σf

Mean for a list of data


Eg the number of runs scored in each innings by Kevin Pietersen during the
successful 2005 Ashes series were 64, 57, 20, 71, 0, 21, 23, 45, 158, 14
64  57  20  71  0  21  23  45  158  14 473
Mean    47.3 runs
10 10
How is the written using the notation above?

 
x fx
 for data in a list for data in a frequency distribution
n f
Dealing with large amounts of data
If data has a small set of possible values it can be useful to display it in
a frequency table, rather than as a list.

Eg the number of goals conceded by Everton each game in a season:

4,5,7,2,2,3,3,4,2,6,5,4,5,5,4,6,3,1,2,3,4,5,6,5,4,3,2,3,3,4,5,3,6,5,5,3,4,1

Counting these, as with a tally, we get the following:

Goals Frequency
Displaying the data in this way has
1 2 several benefits:
2 5
3 9 •Data calculations can be done quickly

4 8 •You can see the overall ‘distribution’


9 of the data easily
5
6 4 •Less writing!
7 1
Mean for large amounts of data
Eg the number of goals conceded by Everton each game in a season is given in
the frequency table below. Calculate the mean number of goals scored per game.

fx
Goals Frequency
  fx
2 2
1
2 5 10
f
3 9 27 147
Mean 
4 8 32 38
5 9 45
 3.868...
6 4 24
7 1 7  3.9 goals (1dp)

f  38  fx  147 Why calculate the fx?


1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,6,6,7
2  1  24  2  8 9  3  27 8  4  32 9  5  45 4  6 124
7  7
The fx find the total for each value, then we add them to find
the total value of all data - this is needed to calculate the mean
Data Calculations- grouped data
If data has a large set of possible values it can be useful to sort it into
groups to help understand how the data is distributed.

Eg the times taken by 36 pupils to run a 100m race

Time Displaying the data in this way has


Frequency
t seconds several benefits:
12 ≤ t < 13 3 •You can see the overall ‘distribution’
of the data easily
13 ≤ t < 14 8
•Less writing!
14 ≤ t < 16 16
•Data calculations can be done quickly,
16 ≤ t < 18 7
although you cannot know exact
18 ≤ t < 24 2 values of mean, median and mode

Shorthand for saying ‘times between 16 and 18 seconds, including 16 exactly


but less than 18’. This avoids confusion about which class values go into.
Estimating mean for grouped data
Eg The times taken by 36 pupils to run an 800m race are given
in the table below. Work out an estimate for the mean time.

Time Midpoint
Frequency fx
t seconds x
  fx
12 ≤ t < 13 3 12.5 37.5
f
13 ≤ t < 14 8 13.5 108
14 ≤ t < 16 16 15 240 Use class
midpoints as x
16 ≤ t < 18 7 17 119
18 ≤ t < 24 2 21 42
Total f = 36 Total fx = 546.5

546.5
Estimated mean   15.180...  15.2 seconds (1dp)
36
Poorly defined groups
Eg The times taken by 26 pupils to run an 800m race are given in the table below.
Estimate the mean time taken to run the race.

You must be careful identifying class boundaries, midpoints


and class widths when dealing with groups that don’t ‘meet’

Time Class Midpoint


Frequency boundaries fx
t seconds x
120-130 2 119.5-130.5 125 250
131-150 4 130.5-150.5 140.5 562
151-180 9 150.5-180.5 165.5 1489.5
181-205 6 180.5-205.5 193 1158
206-225 5 205.5-225.5 215.5 1077.5

Total f = 26 Total fx = 4537

4537
Estimated mean   174.5 seconds
26
Statistical Calculations with grouped data
1. Mr Walker is analysing the January exam results in Maths:

4 0 5 Key: 5 │ 4 means 54%


5 1 4 9
6 2 5 7 8 n  22

7 2 3 6 6  x  1615
8 0 4 7 7 7
9 2 3 8 9

1615
Mode = Mean μ =  73.4
22
Lower quartile Q1 =
Standard deviation σ =
Median Q2 =

Upper quartile Q3 = Q3  2Q2  Q1


Skew using
Q3  Q1
IQR =
2. Mr Walker is also analysing the January exam results in Science:

4 3 8 9 You may use that  x  1458


5 1 1 5 8

2
and fx  100608
6 2 3 3 3 5 9
7 0 3 3 9 n  22
8 1 1 2 4
9 5

1458
Mode = Mean μ =  66.3
22
Lower quartile Q1 =
Standard deviation σ =
Median Q2 =

Upper quartile Q3 = Q3  2Q2  Q1


Skew using
Q3  Q1
IQR =
3. Farmer Smith wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 3-week period by his best hen:

Number of
Frequency fx
eggs
1 3 3
2 6 12
3 5 15
4 1 4
5 6 30
 fx  64
Mode =
64
Mean μ =  3.0
Lower quartile Q1 = 21

Median Q2 = Standard deviation σ =

Upper quartile Q3 =
mean - median
Skew using
IQR = standard deviation
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:

Number of
Frequency
eggs
1 2 You may use that  fx  89
2 5

2
and fx  315
3 10
4 8
5 3

Mode = 89
Mean μ =  3.2
28
Lower quartile Q1 =

Median Q2 = Standard deviation σ =

Upper quartile Q3 = Skew by comparing mean, median & mode

IQR =
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:

Weight (wkg) Frequency You may use that


0 ≤w< 30 13
 fx  3320
30 ≤w < 40 24

2
and fx  191750
40 ≤w < 70 17
70 ≤w < 80 12
80 ≤w < 100 5

f  71
3320
Mean μ =  46.8
71
Lower quartile Q1 =

Median Q2 = Standard deviation σ =

Upper quartile Q3 = Q3  2Q2  Q1


Skew using
Q3  Q1
IQR =
6. The owners of McDonald’s want to analyse the weights of their target customers.
They conduct a survey:
Weight (kg) Frequency Midpoint fx
0-20 24 10.25 246
21-40 31 30.5 945.5
41-60 47 50.5 2373.5
61-80 75 70.5 5287.5
81-100 23 90.5 2081.5
f  200  fx  10934

Lower quartile Q1 = 10934


Mean μ =  54.7
200
Median Q2 =
Standard deviation σ =

Upper quartile Q3 =
3mean - median 
Skew using
standard deviation
IQR =
Combined mean
Eg The mean percentage achieved in S1 was 58% for the 12 pupils that sat it in 2008
The mean percentage achieved in S1 was 76% for the 7 pupils that sat it in 2009
What is the overall mean for 2008-2009?

Both:  x  696  504  1200


2008:  x  58  12  696 n  12  7  19
2009:  x  72  7  504 1200
  63.12%
19
Eg Aston Villa are considering whether to expand their stadium.
For the first 10 home games of the season, the total attendance is 382460.
The mean attendance for all 19 home games is 40739
a) Find the mean attendance for the last 9 home games of the season.
b) What do these statistics suggest about the proposal to expand the stadium?
All 19 games:  x  19  40739  774041
391581
Last 9 games:  x  774041  382460  391581     43509
9
382460 Attendances have improved, supporting
First 10 games:    38246
10 the argument for expanding the stadium
Puzzles involving averages
1 2 3 4 5
6 7 8 9 10

PickPick four
three numbers
numbers
Pick
Pick
Pickfour
three
four
so whose
that themean
numbers
numbers
numbers with
mean
with isamedian
withaa 4mean
is and of
mean
less median
of567 theismedian
than
of 3
Median & quartiles for a list of data
You must be able to identify the median, upper and lower quartiles of data

Lower quartile = ¼ of way 4n th If these calculations give an integer,


find the average of this and the next
Median = ½ way 2n th piece of data
Upper quartile = ¾ of way 34n th If not, just use the next piece of data

Eg the number of runs scored in each innings by Kevin Pietersen during the
successful 2005 Ashes series were 64, 57, 20, 71, 0, 21, 23, 45, 158, 14

10 data values  n  10 In order: 0, 14, 20, 21, 23, 45, 57, 64, 71, 158
1st 2nd 3rd 4th 5th 6th 7th 8th
n 10
Lower quartile:   2.5  3 rd  20 runs
4 4
n 10 5th  6th 23  45
Median:  5    34 runs
2 2 2 2
3n
Upper quartile:  7.5  8th  64 runs
4
Median & quartiles for large amounts of data
Eg the number of goals conceded by Everton each game in a season is given in the
frequency table below. Calculate the median number of goals scored per game.

Goals Frequency Running total Make running totals of frequencies


1 2 2 Lower quartile
2 5 7 n
 9.5  10th  3 goals
3 9 16 4
4 8 24 Median
5 9 33 n 19 th  20 th
 19   4 goals
4 2 2
6 37
7 1 38 Upper quartile
3n
n = 38  28.5  29th  5 goals
4
Why find the last position that each value takes?

1,1,2,2,2,2,2,3,3,3,3,3,3,3,3,3,4,4,4,4,4,4,4,4,5,5,5,5,5,5,5,5,5,6,6,6,6,7
2nd 7th 16th 24th 33rd

So the 17th to 24th values, including the median 19th, must be 4


Estimating median & quartiles for grouped data
Eg The times taken by 36 pupils to run a 100m race are given in the
table below. Use interpolation to estimate the median & quartiles.
n
Time
Frequency
Running Lower quartile  9th
t seconds total 4
12 ≤ t < 13 3 3 6 into the 13≤t ≤14 group

13 ≤ t < 14 8 11 13  86  1  13.75
14 ≤ t < 16 16 27 n
Median  18 th
16 ≤ t < 18 7 34 2
18 ≤ t < 24 2 36 7 into the 14≤t ≤16 group
14  167  2  14.875
Lower quartile   
n th
4
3n
Median   2

n th Upper quartile  27 th
4
Upper quartile   4 
3n th
16 into the 14≤t ≤16 group
14  16
16  2  16
Poorly defined groups
Eg The times taken by 26 pupils to run an 800m race are given in the table below.
Use interpolation to find the median and quartiles

You must be careful identifying class boundaries, midpoints


and class widths when dealing with groups that don’t ‘meet’

Time Class Class


Frequency boundaries
t seconds width
120-130 2 119.5-130.5 11
131-150 4 130.5-150.5 20
151-180 9 150.5-180.5 30
181-205 6 180.5-205.5 25
206-225 5 205.5-225.5 20
n
Lower quartile  6.5th  150.5  6.596  30  152.17
4 3n
Upper quartile  19.5th
n 4
Median  13 th  150.5  1396  30  173.83  180.5  19.5615  25  199.25
2
Statistical Calculations with grouped data
1. Mr Walker is analysing the January exam results in Maths:

4 0 5 Key: 5 │ 4 means 54%


5 1 4 9
6 2 5 7 8 n  22

7 2 3 6 6  x  1615
8 0 4 7 7 7
9 2 3 8 9

1615
Mode = Mean μ =  73.4
22
Lower quartile Q1 = 6th  62
Standard deviation σ =
Median Q2 = 11.5  74.5
th

Upper quartile Q3 = 17 th  87 Q3  2Q2  Q1


Skew using
Q3  Q1
IQR = 25
2. Mr Walker is also analysing the January exam results in Science:

4 3 8 9 You may use that  x  1458


5 1 1 5 8

2
and fx  100608
6 2 3 3 3 5 9
7 0 3 3 9 n  22
8 1 1 2 4
9 5

1458
Mode = Mean μ =  66.3
22
Lower quartile Q1 = 6th  55
11th 12th
Standard deviation σ =
Median Q2 = 2  64

Upper quartile Q3 = 17 th  79 Q3  2Q2  Q1


Skew using
Q3  Q1
IQR = 24
3. Farmer Smith wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 3-week period by his best hen:

Number of Running
Frequency fx total
eggs
1 3 3 3
2 6 12 9
3 5 15 14
4 1 4 15
5 6 30 21
 fx  64
Mode =
64
Mean μ =  3.0
Lower quartile Q1 = 6th  2 21

Median Q2 = 11  3
th
Standard deviation σ =

Upper quartile Q3 = 16 th  5
mean - median
Skew using
IQR = 3 standard deviation
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:

Number of Running
Frequency
eggs total
1 2 2 You may use that  fx  89
2 5 7

2
and fx  315
3 10 17
4 8 25
5 3 28

Mode = 89
Mean μ =  3.2
28
7 th  8 th
Lower quartile Q1 = 2  2.5

Median Q2 = 14 th 15 th
3 Standard deviation σ =
2

21st  22 nd
Upper quartile Q3 = 2 4 Skew by comparing mean, median & mode

IQR = 1.5
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:

Weight (wkg) Frequency Running total


You may use that
0 ≤w< 30 13 13
 fx  3320
30 ≤w < 40 24 37

2
and fx  191750
40 ≤w < 70 17 54
70 ≤w < 80 12 66
80 ≤w < 100 5 71

f  71
3320
Mean μ =  46.8
71
Lower quartile Q1 = 30  4.75
24  10  32.0

Median Q2 = 30  22.5
 10  39.4 Standard deviation σ =
24

Upper quartile Q3 = 40  1617.25  30  68.7 Q3  2Q2  Q1


Skew using
Q3  Q1
IQR = 36.7
6. The owners of McDonald’s want to analyse the weights of their target customers.
They conduct a survey:
Weight (kg) Frequency Midpoint fx Running total

0-20 24 10.25 246 24


21-40 31 30.5 945.5 55
41-60 47 50.5 2373.5 102
61-80 75 70.5 5287.5 177
81-100 23 90.5 2081.5 200
f  200  fx  10934

Lower quartile Q1 = 20.5  26


31  20  37.3 10934
Mean μ =  54.7
200
Median Q2 = 40.5  45
47  20  59.6
Standard deviation σ =

Upper quartile Q3 = 60.5  48


 20  73.3
3mean - median 
75
Skew using
standard deviation
IQR = 36
Measuring spread
Eg the length of 5 worms are 4cm, 5cm, 8cm, 13cm and 15cm
Suppose you found the average difference between each piece of data and the mean

54 1 4 6

0 5 10 15
4  5  8  13  15 5  4  1 4  6  4
Mean = 9 Average difference =
5 5

The larger the average difference, the more spread out the data

What would happen if you used a calculator to do this?

Sum of differences = (4  9)  (5  9)  (8  9)  (13  9)  (15  9)  5  4  1 4  6  0

If you weren’t careful the data values less than the mean would
contribute a negative difference and mess up the calculation…
Standard Deviation
In the GCSE you used range and inter-quartile range to measure the spread of data
In S1 you also use a measure called standard deviation, denoted by the symbol σ

Closely connected to the idea shown on the last page:

   x   2
where μ is the mean, calculated by   x
n n
This considers the square of the difference between each piece of
data and the mean, and so avoids problems with negative differences

Eg the length of 5 worms are 4cm, 5cm, 8cm, 13cm and 15cm

(4  9) 2  (5  9) 2  (8  9) 2  (13  9) 2  (15  9) 2
 = 4.3 (1dp)
5

NB: this calculation does not give the same answer as the method shown
previously, so do not be tempted to calculate standard deviation without squaring
Standard Deviation for a list of data
The calculation involves repeatedly subtracting the mean.
Using algebra beyond the scope of S1, the rule can be  x 2

 2
manipulated into a form that is faster to calculate: n

Eg the length of 5 worms are 4cm, 5cm, 8cm, 13cm and 15cm

4 2  5 2  8 2  13 2  15 2
Mean   9 from previously    9 2 = 4.3 (1dp)
5

Often, you will be told what Σx2 and Σx are, and only need to calculate the
mean before substituting those values into the standard deviation formula:

Eg given that Σx2 = 641.5 and Σx = 53.8 for 8 pieces of data, calculate σ

2
641.5  53.8 
   = 5.9 (1dp)
8  8 

NB: The rule for σ is not given to you on the formula sheet – you must memorise it
Standard deviation for large amounts of data
Eg the number of goals conceded by Everton each game in a season
is given in the frequency table below. Calculate the standard deviation
in the number of goals scored per game.

Goals Frequency fx2   fx



147
1 2 2 f 38
2 5 20 from previously
9 81
3
  fx 2
 2
4 8 128 n
5 9 225 2
4 144 649  147 
6   
38  38 
7 1 49
 f  38  fx 2  649  1.45 goals (2dp)
Standard deviation for grouped data
Eg The times taken by 36 pupils to run a 100m race are given in the table below.
Estimate the standard deviation, given that  fx  546.5 and  fx 2  8431.75

Time
Frequency You will often be given Σfx and Σfx2
t seconds
12 ≤ t < 13 3
13 ≤ t < 14 8
14 ≤ t < 16 16
16 ≤ t < 18 7
18 ≤ t < 24 2

  fx 2
 2 
8431.75  546.5 

2

  1.94 seconds (2dp)


n 36  36 
Statistical Calculations with grouped data
1. Mr Walker is analysing the January exam results in Maths:

4 0 5 Key: 5 │ 4 means 54%


5 1 4 9
n  22  x  124551
2
6 2 5 7 8
7 2 3 6 6  x  1615
8 0 4 7 7 7
9 2 3 8 9

1615
Mode = Mean μ =  73.4
22
Lower quartile Q1 = 6th  62 2
124551  1615 
Standard deviation σ =    16.5
Median Q2 = 11.5th  74.5 22  22 

Upper quartile Q3 = 17 th  87 Q3  2Q2  Q1


Skew using
Q3  Q1
IQR = 25
2. Mr Walker is also analysing the January exam results in Science:

4 3 8 9 You may use that  x  1458


5 1 1 5 8

2
and fx  100608
6 2 3 3 3 5 9
7 0 3 3 9 n  22
8 1 1 2 4
9 5

1458
Mode = Mean μ =  66.3
22
Lower quartile Q1 = 6th  55 2
100608  1458 
11th 12th
Standard deviation σ =    13.5
Median Q2 = 2  64 22  22 

Upper quartile Q3 = 17 th  79 Q3  2Q2  Q1


Skew using
Q3  Q1
IQR = 24
3. Farmer Smith wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 3-week period by his best hen:

Number of Running
Frequency fx total
fx 2
eggs
1 3 3 3 3
2 6 12 9 24
3 5 15 14 45
4 1 4 15 16
5 6 30 21 150
 fx  64  fx 2
 238
Mode =
64
Mean μ =  3.0
Lower quartile Q1 = 6th  2 21
2
238  64 
Median Q2 = 11  3
th
Standard deviation σ =     1.4
21  21 

Upper quartile Q3 = 16 th  5
mean - median
Skew using
IQR = 3 standard deviation
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:

Number of Running
Frequency
eggs total
1 2 2 You may use that  fx  89
2 5 7

2
and fx  315
3 10 17
4 8 25
5 3 28

Mode = 89
Mean μ =  3.2
28
7 th  8 th
Lower quartile Q1 = 2  2.5
2
315  89 
14 th 15 th Standard deviation σ =     1.1
Median Q2 = 2 3 28  28 
21st  22 nd
Upper quartile Q3 = 2 4 Skew by comparing mean, median & mode

IQR = 1.5
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:

Weight (wkg) Frequency Running total


You may use that
0 ≤w< 30 13 13
 fx  3320
30 ≤w < 40 24 37

2
and fx  191750
40 ≤w < 70 17 54
70 ≤w < 80 12 66
80 ≤w < 100 5 71

f  71
3320
Mean μ =  46.8
71
Lower quartile Q1 = 30  4.75
24  10  32.0
2
191750  3320 
Median Q2 = 30  Standard deviation σ =  
22.5
24  10  39.4 71  71 
 22.7
Upper quartile Q3 = 40  1617.25  30  68.7 Q3  2Q2  Q1
Skew using
Q3  Q1
IQR = 36.7
6. The owners of McDonald’s want to analyse the weights of their target customers.
They conduct a survey:
Weight (kg) Frequency Midpoint fx Running total fx 2
0-20 24 10.25 246 24 2521.5
21-40 31 30.5 945.5 55 28837.75
41-60 47 50.5 2373.5 102 119861 .75
61-80 75 70.5 5287.5 177 372768.75
81-100 23 90.5 2081.5 200 188375.75
f  200  fx  10934  fx 2  712365.5
Lower quartile Q1 = 20.5  26
31  20  37.3 10934
Mean μ =  54.7
200
Median Q2 = 40.5  45
47  20  59.6 2
712365 .5  10934 
Standard deviation σ =  
200  200 

Upper quartile Q3 = 60.5  48


 20  73.3  23.9
3mean - median 
75
Skew using
standard deviation
IQR = 36
Standard Deviation Rule v2

  x   2
  x   2

 2 
n n


 x 2
 2 x   2 
n


 x 2

 2
 x

 1 2

n n n


 x 2

 2 2   2
n


 x 2

 2
 
x 2

 2
n n
Measuring skew

Symmetrical
Mode = Median = Mean
Q2-Q1 = Q3-Q2

Positively skewed
Mode < Median < Mean
Q2-Q1 < Q3-Q2

Negatively skewed
Mode > Median > Mean
Q2-Q1>Q3-Q2
Other measures of skew
If mean > mode indicates positive skew, then:

mean - mode > 0 if data is positively skewed


standard deviation < 0 if data is negatively skewed

Dividing by σ ‘scales’ the value, but has no effect on its sign as σ is always positive

Similarly, if mean > median indicates positive skew, then:

3mean - median  > 0 if data is positively skewed


standard deviation < 0 if data is negatively skewed

If Q3-Q2 > Q2-Q1 indicates positive skew, then so does Q3  Q2   Q2  Q1   0
You will be told which of these  Q3  2Q2  Q1  0
measures to use in the exam – Scaling this by dividing by the IQR Q3 – Q1
all you have to do is substitute
the values and remember that if Q3  2Q2  Q1 > 0 if data is positively skewed
the outcome is positive, this
indicates positive skew!
Q3  Q1 < 0 if data is negatively skewed
Statistical Calculations with grouped data
1. Mr Walker is analysing the January exam results in Maths:

4 0 5 Key: 5 │ 4 means 54%


5 1 4 9
6 2 5 7 8

 x  124551
2
7 2 3 6 6  x  1615
8 0 4 7 7 7
n  22
9 2 3 8 9

1615
Mode = 87 Mean μ =  73.4
22
Lower quartile Q1 = 6th  62 2
124551  1615 
Standard deviation σ =    16.5
Median Q2 = 11.5th  74.5 22  22 

Upper quartile Q3 = 17 th  87 Q3  2Q2  Q1


Skew using  87  274.5  62
87  62 0
Q3  Q1
IQR = 25 No skew
2. Mr Walker is also analysing the January exam results in Science:

4 3 8 9 You may use that  x  1458


5 1 1 5 8

2
and fx  100608
6 2 3 3 3 5 9
7 0 3 3 9 n  22
8 1 1 2 4
9 5

1458
Mode = 63 Mean μ =  66.3
22
Lower quartile Q1 = 6th  55 2
100608  1458 
11th 12th
Standard deviation σ =    13.5
Median Q2 = 2  64 22  22 

Upper quartile Q3 = 17 th  79 Q3  2Q2  Q1


Skew using  79  264  55
79 55  1
4
Q3  Q1
IQR = 24
Positive skew
3. Farmer Smith wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 3-week period by his best hen:

Number of Running
Frequency fx total
fx 2
eggs
1 3 3 3 3
2 6 12 9 24
3 5 15 14 45
4 1 4 15 16
5 6 30 21 150
 fx  64  fx 2
 238
Mode = 2 and 5
64
Mean μ =  3.0
Lower quartile Q1 = 6th  2 21
2
238  64 
Median Q2 = 11  3
th
Standard deviation σ =     1.4
21  21 

Upper quartile Q3 = 16 th  5 mean - median


Skew using  3 3
1.4 0
standard deviation
IQR = 3 No skew
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:

Number of Running
Frequency
eggs total
1 2 2 You may use that  fx  89
2 5 7

2
and fx  315
3 10 17
4 8 25
5 3 28

Mode = 3 89
Mean μ =  3.2
28
7 th  8 th
Lower quartile Q1 = 2  2.5
2
315  89 
14 th 15 th Standard deviation σ =     1.1
Median Q2 = 2 3 28  28 
21st  22 nd
Upper quartile Q3 = 2 4 Skew by comparing mean, median & mode
mean  median & mode  positive skew
IQR = 1.5
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:

Weight (wkg) Frequency Running total


You may use that
0 ≤w< 30 13 13
 fx  3320
30 ≤w < 40 24 37

2
and fx  191750
40 ≤w < 70 17 54
70 ≤w < 80 12 66
80 ≤w < 100 5 71

f  71
3320
Mean μ =  46.8
71
Lower quartile Q1 = 30  4.75
24  10  32.0
2
191750  3320 
Median Q2 = 30  Standard deviation σ =  
22.5
24  10  39.4 71  71 
 22.7
Upper quartile Q3 = 40  1617.25  30  68.7 Q3  2Q2  Q1
Skew using  0.592
Q3  Q1
IQR = 36.7 Positive skew
6. The owners of McDonald’s want to analyse the weights of their target customers.
They conduct a survey:
Weight (kg) Frequency Midpoint fx Running total fx 2
0-20 24 10.25 246 24 2521.5
21-40 31 30.5 945.5 55 28837.75
41-60 47 50.5 2373.5 102 119861 .75
61-80 75 70.5 5287.5 177 372768.75
81-100 23 90.5 2081.5 200 188375.75
f  200  fx  10934  fx 2  712365.5
Lower quartile Q1 = 20.5  26
31  20  37.3 10934
Mean μ =  54.7
200
Median Q2 = 40.5  45
47  20  59.6 2
712365 .5  10934 
Standard deviation σ =  
200  200 

Upper quartile Q3 = 60.5  48


 20  73.3  23.9
3mean - median 
75
Skew using  -0.624
standard deviation
IQR = 36 Negative skew
Outliers
Eg the number of runs scored in each innings by Kevin Pietersen during the
successful 2005 Ashes series were 64, 57, 20, 71, 0, 21, 23, 45, 158, 14

From previously, Q1  20 runs Q2  34 runs Q3  64 runs

Do any of his scores stand out as unusual or incoherent with his other performances?
Clearly, the 158 was by far his best score – it could be considered as an outlier

Outliers are usually identified by the rule:

If x  Q3  1.5Q3  Q1  or x  Q1  1.5Q3  Q1  then x is outlying


What scores would be classed as outliers using this measure?

Q3  1.5Q3  Q1   64  1.5  44  130 the score of 158 is an outlier


Q1  1.5Q3  Q1   20  1.5  44  46 it is not possible here to have a ‘small’ outlier

The boundary of being 1½ times the IQR away was arbitrary, and x    2
you may be given a different threshold for classifying outliers x    2
Now try Ex4G, p72, Q1,8
Interpreting and comparing data
Pupils lose marks in S1 because they are unable to interpret and compare data.

The number of marks available tell you how many different measures to analyse

The main features of a distribution, in order of importance, are:

Average – compare means and/or medians

Spread – compare IQR and/or standard deviation

Skew – use given measures


Outliers – usually use > Q3 + 1.5IQR and < Q1 – 1.5IQR

If there are 2 marks, write one sentence each about average and spread

If there are 3 marks, write one sentence each about average, spread and skew

If there are 4 marks, write one sentence each about all four main features
Relating measures to the context
It is also critical that you relate these measures to the context of the data…

A survey is done of test marks for 2 classes. Here is a summary of statistics:

Class A Class B Compare this data (2marks)

Mean 71% 72% As there are 2 marks,


Standard write one sentence each
12% 6% about average and spread
Deviation

The classes have a very similar mean, suggesting that on average, results are similar

Relate measures to the


context of the data…
Class A has a higher standard deviation, showing that their results are more spread out
Eg The box plots show the distribution of the number of worms found in the
soil of two farms, one which uses pesticides and another which is organic.
It is claimed that the use of pesticides is affecting to the worm population.
Assess this claim. (3 marks)

Organic farm

Farm using pesticides

The median for the farm which uses pesticides is lower– on average there are less
worms in its soil.
The IQR and range for the farm which uses pesticides is also lower– there is less
variation in the number of worms in the soil.
Now try:
The data supports the claim as the Ex4F, p71-72, Q1,4
average and spread have decreased. Ex4G, p73-74, Q3+4
7a) Compare the Maths exam marks with the Science exam marks

b) Mr Brown wants to give a bonus to a department on the basis of the exam results.
Use your answer to (a) to advise him.

8) Which farmer’s top hen would you want on your farm? Explain why!

9) McDonalds claim they cater to a healthier target audience than their rivals.
Comment on this claim with reference to your answers to questions 5 and 6
Which measures to use?
Usually, when analysing average and spread you choose either:

Mean and standard deviation Median and IQR

What kind of distributions suit these two options?

If a distribution is symmetrical If a distribution is skewed, median and


(often when there is a lot of data), IQR are used as they are unaffected by
mean and standard deviation are any outliers/extreme values
used as ideally, all data is
reflected in the measures used.
Relative dispersion
A survey is done of test marks for 2 classes. Here is a summary of statistics:

Class A Class B
Which class’s results
Mean 68% 73%
are more spread out?
Standard
5% 6%
Deviation

Class B have a larger standard deviation, but a higher mean too. Is it a fair comparison?


Calculating will enable you to ‘fairly’ compare the dispersion of 2 sets of data.

 
Class A:  13.6 Class B:  12.2
 
So the mean is about 14 times the standard By dividing by σ, you ‘scale
deviation for class A – whereas in class B the down’ the numbers to
mean is about 12 times the standard deviation. make a fair comparison
Class B’s results are more spread out, relative to the mean
Histograms Height (h)
in cm
Frequency
Frequency
density
Eg Some data on height 130 ≤ h < 150 42 42  20  2.1
150 ≤ h < 160 35 35  10  3.5
Frequency 160 ≤ h < 165 16 16  5  3.2
Frequency density 
Class width
165 ≤ h < 180 39 39  15  2.6
Finding frequencies from histograms
Eg The histogram gives information about the books sold in a bookshop
one Saturday. Use the histogram to complete the table.

Price (P) in Frequency


pounds (£)
0<P ≤5 8  5  40
5 < P ≤ 10 12  5  60
10 < P ≤ 20 5.6  10  56
20 < P ≤ 40 1.6  20  32

Frequency = frequency density  class width


S1 histograms
In the GCSE, you learnt that Area  Frequency
In fact, Area  Frequency meaning Area  k  Frequency for some constant k
Some questions require you to find k so you can determine frequencies.

Eg The 60 employees of a company are surveyed on


their salaries and a histogram is made of this data:

Calculate the number of employees


that earn less than £40000

Total area = 80 + 60 + 48 + 52 = 240


Area  k  Frequency  240  k  60
k 4

Frequency  Area  4

80  4  20
Salary £1000s

Now try Ex4E, p64, Q2-5


You may not to work out the value of k, but use the logical facts that:
Bar width  class width Bar height  frequency density
Eg the following data is collected on the times of some
pupils to complete a question, to the nearest minute:
Time (minutes) 3-5 6-8 9-14 15-20
Frequency 8 15 9 4
Class width 8 .5  5 . 5  3 14.5  8.5  6
Frequency 15 9
5  1.5
density 3 6

When a histogram is constructed for this data, the 6-8 minutes bar has
width 2cm and height 3cm. Find the dimensions of the 9-14 minutes bar.
 0.3 Now try Ex4G, p75, Q6

6-8 5 units = 3cm 1.5 units = 0.9cm 9-14


6 units = 4cm
3 units = 2cm
Identify the class widths & frequency
2 densities of the bars concerned
Eg data on the salaries of a company’s employees
Why bother with
Salary P in
Frequency
Frequency a histogram?
£1000s density
20 < P ≤ 40 20 1 Which gives a better
40 < P ≤ 50 15 1.5 insight about the data?
50 < P ≤ 55 12 2.4 The histogram as it considers
55 < P ≤ 65 13 1.3 height in relation to width

As a bar graph As a histogram


Statistical Calculations with grouped data
1. Mr Walker is analysing the January exam results in Maths:

4 0 5 Key: 5 │ 4 means 54%


5 1 4 9
6 2 5 7 8
7 2 3 6 6
8 0 4 7 7 7
9 2 3 8 9

Mode = Mean μ =

Lower quartile Q1 =
Standard deviation σ =
Median Q2 =

Upper quartile Q3 = Q3  2Q2  Q1


Skew using
Q3  Q1
IQR =
2. Mr Walker is also analysing the January exam results in Science:

4 3 8 9 You may use that  x  1458


5 1 1 5 8

2
and fx  100608
6 2 3 3 3 5 9
7 0 3 3 9
8 1 1 2 4
9 5

Mode = Mean μ =

Lower quartile Q1 =
Standard deviation σ =
Median Q2 =

Upper quartile Q3 = Q3  2Q2  Q1


Skew using
Q3  Q1
IQR =
3. Farmer Smith wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 3-week period by his best hen:

Number of
Frequency
eggs
1 3
2 6
3 5
4 1
5 6

Mode =

Mean μ =
Lower quartile Q1 =

Median Q2 = Standard deviation σ =

Upper quartile Q3 =
mean - median
Skew using
IQR = standard deviation
4. Farmer Jones also wants to know which of his hens is the most prolific egg-layer.
He records the number of eggs laid each day over a 4-week period by his best hen:

Number of
Frequency
eggs
1 2 You may use that  fx  89
2 5

2
and fx  315
3 10
4 8
5 3

Mode =
Mean μ =
Lower quartile Q1 =

Median Q2 = Standard deviation σ =

Upper quartile Q3 = Skew by comparing mean, median & mode

IQR =
5. The owners of KFC want to analyse the weights of their target customers.
They conduct a survey:

Weight (wkg) Frequency You may use that


0 ≤w< 30 13
 fx  3320
30 ≤w < 40 24

2
and fx  191750
40 ≤w < 70 17
70 ≤w < 80 12
80 ≤w < 100 5

Mean μ =
Lower quartile Q1 =

Median Q2 = Standard deviation σ =

Upper quartile Q3 = Q3  2Q2  Q1


Skew using
Q3  Q1
IQR =
6. The owners of McDonald’s want to analyse the weights of their target customers.
They conduct a survey:
Weight (kg) Frequency
0-20 24
21-40 31
41-60 47
61-80 75
81-100 23

Lower quartile Q1 =
Mean μ =

Median Q2 =
Standard deviation σ =

Upper quartile Q3 =
3mean - median 
Skew using
standard deviation
IQR =
7a) Compare the Maths exam marks with the Science exam marks

b) Mr Brown wants to give a bonus to a department on the basis of the exam results.
Use your answer to (a) to advise him.

8) Which farmer’s top hen would you want on your farm? Explain why!

9) McDonalds claim they cater to a healthier target audience than their rivals.
Comment on this claim with reference to your answers to questions 5 and 6
WB1 Over a period of time, the number of Number leaving Totals
people x leaving a hotel each morning
2 7 9 9 (3)
was recorded. These data are
summarised in the stem and leaf diagram 3 2 2 3 5 6 (5)
below. For these data, 4 0 1 4 8 9 (5)
(a) write down the mode,
5 2 3 3 6 6 6 8 (7)
(b) find the values of the three quartiles.
6 0 1 4 5 (4)
7 2 3 (2)
Mode = 56 8 1 (1)
n  27
n
Lower quartile:  6.75  7th  35
4
n
Median:  13.5  14th  52
2
3n
Upper quartile:  20.25  21st  60
4
Given that Σx = 1335 and Σx2 = 71801 find
(c) the mean and the standard deviation of these data.
mean – mode
One measure of skewness is found using .
standard deviation
(d) Evaluate this measure to show that these data are negatively skewed.
(e) Give two other reasons why these data are negatively skewed

c)  
1335
 49.444...   x
2   x 2

 2
27 n n
2
71801  1335  17378
 
2
  
27  27  81
Mean = 49.4 (1dp)
17378
   14.647 ... Standard deviation = 14.6 (1dp)
81
49.444...  56
d)  0.4475 ... < 0 indicating negative skew
14.647...

e) For negative skew: Mode > Median > Mean Q2-Q1 > Q3-Q2
56  52  49.4 52  35  60  52
WB2 The following table summarises the Number of
Distance (km)
distances, to the nearest km, that 134 examiners
examiners travelled to attend a meeting in 41–45 4
London.
(a) Give a reason to justify the use of a 46–50 19
histogram to represent these data. 51–60 53
(b) Calculate the frequency densities needed 61–70 37
to draw a histogram for these data.
(DO NOT DRAW THE HISTOGRAM) 71–90 15
91–150 6
a) Data is continuous and class widths vary

Effective
Number of Frequency Frequency
class Class width Frequency density 
examiners density Class width
boundaries
40.5  45.5 5 4 4  5  0.8
45.5  50.5 5 19 3.8
50.5  60.5 10 53 5.3
60.5  70.5 10 37 3.7
70.5  90.5 20 15 0.75
90.5  150.5 60 6 0.1
(c) Use interpolation to estimate the median Q2, Distance Number of Running
the lower quartile Q1, and the upper quartile Q3 (km) examiners total

40.5–45.5 4 4
The mid-point of each class is represented by x
45.5–50.5 19 23
and the corresponding frequency by f.
Calculations then give the following values 50.5–60.5 53 76
Σfx = 8379.5 and Σfx2 = 557489.75 60.5–70.5 37 113
(d) Calculate an estimate of the mean and an
70.5–90.5 15 128
estimate of the standard deviation for these data.
90.5–150.5 6 134

6233
 58.80
f  134
Median Q2 = 67th  50.5  6753  23
 10 
106
th  50.5  33.5  23  10 
5563
Lower quartile Q1 = 33.5 53
 52.48
106
4967
Upper quartile Q3 = 100.5 th  60 .5  100.5  76  10 
37
 67.12
74
2
8379 .5 557489 .75  8379.5 
Mean μ =  62.53 Standard deviation σ =  
134 134  134 
 15.81
Q3  2Q2  Q1
WB2 One coefficient of skewness is given by
Q3  Q1
(e) Evaluate this coefficient and comment on the skewness of these data.
(f) Give another justification of your comment in part (e).

5563 6233 4967 Q3  2Q2  Q1


Q1  Q2  Q3    0.14
106 106 74 Q3  Q1

For positive skew: Median < Mean


58.80  62.53

or Q2-Q1 < Q3-Q2


58.80  52.48  67.12  58.80
WB3 Aeroplanes fly from City A to City B. Over a long period of time the number of
minutes delay in take-off from City A was recorded. The minimum delay was 5
minutes and the maximum delay was 63 minutes. A quarter of all delays were at
most 12 minutes, half were at most 17 minutes and 75% were at most 28 minutes.
Only one of the delays was longer than 45 minutes.
An outlier is an observation that falls either 1.5 x interquartile range above the
upper quartile or 1.5 x interquartile range below the lower quartile.
(a) On graph paper, draw a box plot to represent these data.

Min  5 Q1  12 Q2  17 Q3  28 Max  63
IQR  16 Q3  1.5 IQR  52 Q1  1.5 IQR  12 63 is outlier,
next biggest is 45
(b) Comment on the distribution of delays. Justify your answer.
(c) Suggest how the distribution might be interpreted by a passenger who
frequently flies from City A to City B.

Min  5 Q1  12 Q2  17 Q3  28 Max  63
IQR  16 Q3  1.5 IQR  52 Q1  1.5 IQR  12 63 is outlier,
next biggest is 45

b) Positively skewed, as Q2-Q1 < Q3-Q2


c) Positively skew indicates delays are more often short and not such a problem

You might also like