CH 4 Scatter Diagrams and Correlation

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 45

Edexcel GCSE (9 – 1)

Statistics
Mr M Dominguez
mdominguez@kegs.org.uk
Chapter 4 Scatter Diagrams and
Correlation
Lesson 1: 4.1 to 4.5 Print out Q1 worksheet

Lesson 2: 4.6 to 4.8


Lesson 3: 4.8 to 4.9
§ 4.1 Scatter Diagrams
The most important graphical summary of bivariate data is the scatter
diagram. This is simply a plot of the points (XI, Yi) in the plane. The following
figures show scatter diagram of June maximum temperatures against January
maximum temperatures, and of January maximum temperatures against
latitude.
A key feature in a scatter diagram is the correlation, or trend between X and
Y. “Higher January temperatures tend to be paired with higher June
temperatures, so these two values have a positive correlation.” Higher
latitudes tend to be paired with lower January temperature decreases, so
these values have a Negative correlation. If higher X values are paired with
low or with high Y values equally often, there is no correlation.
For a scatter diagram we plot the explanatory (independent )
variable on the x-axis and the response (dependent) variable on
the y-axis

Sometimes we can struggle identifying which variable is which.


The most obvious way to identify variables is to look at which
comes first in the table of values.

Scatter diagrams are used to represent bivariate data.


A common misconception is that the data must be continuous

You do not need to start your axes from 0. Graphs should


contain a suitable scale.
1) The table below shows the shoe size and mass of 8 men.
(a) Plot a scatter graph for this data and draw a line of best fit.

Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80

100 (b) Why is a scatter diagram


95 suitable for this data?
90
85
Mass (kg)

80
75
70
65
60

4 5 6 7 8 9 10 11 12 13
Shoe Size
§ 4.2 Correlation
There are many different types of correlation not just positive or negative.
Scatter graphs are used to show whether there is a relationship between two sets
of data. The relationship between the data can be described as either:
1. A positive correlation. As one quantity increases so does the other.
2. A negative correlation. As one quantity increases the other decreases.
3. No linear correlation. Both quantities vary with no clear relationship.
Soup Sales

Shoe Size
Height

Shoe Size Temperature Annual Income


Positive Correlation Negative correlation No correlation
A positive or negative correlation is characterised by a straight line with a
positive /negative gradient. The strength of the correlation depends on
the spread of points around the imagined line.

Strong Positive Moderate Positive Weak Positive

Strong negative Moderate Negative Weak negative


Describing / interpreting correlation in context

Two types of questions


• What correlation does the scatter diagram suggest

• Describe the correlation between height and weight. Or;


• Interpret, in context the type of correlation. Or;
• What conclusions can you draw about the correlation between
height and weight?
Describing / interpreting correlation in context

The scatter diagrams shows the heights and weights of different students

• Describe:(strong) Positive correlation.


• Interpret, in context: As height increases the weight increases
1) The table below shows the shoe size and mass of 8 men.
(a) Plot a scatter graph for this data and draw a line of best fit.

Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80

100 (c) What is the correlation


95 between shoe size and mass?
90 Positive correlation
?
85
Mass (kg)

80 (d) Describe/ interpret the


correlation in context.
75
As shoe size increases, Mass
70 increases (Shoe ?size must come
65 first. Why?)
60

4 5 6 7 8 9 10 11 12 13
Shoe Size
1) The table below shows the shoe size and mass of 8 men.
(a) Plot a scatter graph for this data and draw a line of best fit.

Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80

100 (c) What is the correlation


95 between shoe size and mass?
90 Positive correlation
?
85
Mass (kg)

80 (d) Describe/ interpret the


correlation in context.
75
As shoe size increases, Mass
70 increases (Shoe?size must come
65 first. Why?)
60

4 5 6 7 8 9 10 11 12 13
Shoe Size
§ 4.3 Causal Relationships
Do not draw causal implications from statements about associations, unless
your data come from a randomized experiment. Just because January and
June temperatures increase together does not mean that January
temperatures cause June temperatures to increase (or vice versa). The only
certain way to sort out causality is to move beyond statistical analysis and talk
about mechanisms.

In general, if X and Y have an association, then


(i) X could cause Y to change (a causal relationship)
(ii) Y could cause X to change (a causal relationship)
(iii) a third unmeasured (perhaps unknown) variable Z could
cause both X and Y to change.

Unless your data come from a randomized experiment, statistical analysis


alone is not capable of answering questions about causality.
Page 215 Q 1,2, and 6
For the association between January and July temperatures, we can try to
propose some simple mechanisms:
i. warmer or cooler air masses in January persist in the atmosphere until
July, causing similar effects on the July temperature.
ii. None, it is impossible for one event to cause another event that
preceded it in time.
iii. If Z is latitude, then latitude influences temperature because it
determines the amount of atmosphere that solar energy must traverse to
reach a particular point on the Earth’s surface.
§ 4.4 Line of best fit
The line of best fit must:
• Pass through the mean of each data set.
• Have the same number of points above and below the line.

1) The table below shows the shoe size and mass of 10 men.
(e) Find the mean shoe size and the mean mass

Size 5 12 7 10 10 9 8 11 6 8
Mass 65 97 68 92 78 78 76 88 74 80
1) The table below shows the shoe size and mass of 8 men.

Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80

100 (f) Draw a line of best fit


95
The mean point should be
90 plotted on your graph (a cross
85 with a circle round it. The line of
best fit must always pass
Mass (kg)

80
through this point.
75 (mean data 1, mean data 2)
70
In this case: (8.5, 78.625)
65
60

4 5 6 7 8 9 10 11 12 13
Shoe Size
2) The table below shows the number of people who visited a museum over a 10 day
period last summer together with the daily sunshine totals.
(a) Plot a scatter graph for this data and draw a line of best fit.

Hours Sunshine 6 0.5 8.5 3 8 10 7 5 3 2


Visitors 300 475 100 390 200 50 175 220 350 320
500 (b) Draw a line of best fit and
450 comment on the correlation.
400
Number of Visitors

If you have a calculator you can


350 find the mean of each set of
data and plot this point to help
300 you draw the line of best fit.
250 Ideally all lines of best fit should
200 pass through co-ordinates:
(mean data 1, mean data
150 2) In this case:
100

0 1 2 3 4 5 6 7 8 9 10 Means Means 2
Hours of Sunshine
§ 4.5 Interpolation and extrapolation
Using our line of best fit we can estimate the
value of one variable when given the other.

If the value we are estimating is with in our range


of values we call it interpolation.

If the value we are estimating is outside our


range of values we call it extrapolation.

Interpolation estimates will always be more


accurate than Extrapolation estimates.
Furthermore the more you extrapolate the more
inaccurate your estimation will be.
For GCSE maths you may describe an estimation as being inaccurate
as it is out side the collect range of values.
1) The table below shows the shoe size and mass of 8 men.

Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80

100 (g) Use your line of best fit to


95 estimate:
90 87 kg (i) The mass of a man with
85 shoe size 10½.
Mass (kg)

80 (ii) The shoe size of a man


with a mass of 62 kg.
75
(iii) Which estimation will be
70 more accurate and why?
65
part ii is less accurate as it is
60 Size 4.2 extrapolation. or
?
Part i is more accurate as it is
4 5 6 7 8 9 10 11 12 13 interpolation
Shoe Size
2) The table below shows the number of people who visited a museum over a 10 day
period last summer together with the daily sunshine totals.

Hours Sunshine 6 0.5 8.5 3 8 10 7 5 3 2


Visitors 300 475 100 390 200 50 175 220 350 320
500
450 Use your line of best fit to
400 estimate:
Number of Visitors

350 (i) The number of visitors


300 for 4 hours of sunshine.
310
250 (ii) The hours of sunshine
when 250 people visit.
200
150

100

0 1 2 3 4 5 6 7 8 9 10
Hours of Sunshine
§ 4.6 The equation of a line of best fit
To find the equation of the line of best fit you must find
the Gradient. You must also know a point on the line.
Either the y intercept of the mean. You can then use one
of the two general equations for a straight line.

Using the line you can estimate value, but most


importantly you must be able to describe the
significances of m and c in the equation within the
context of the question.

It is incorrect to describe m as the gradient and c as the y


intercept. The descriptions must be in context.
How can we come up with an Maths vs English Test Scores
equation that could estimate a Maths
100
Score (y) from an English score (x)?
90

𝒚=𝟎. 𝟓𝟒 𝒙? +𝟑𝟗
80

70

60

Maths Score
We can find the gradient by 50

picking two random points on 40

the line suitably far apart. 30

Change in y is 43 20 The y-intercept seems


10 to be about 39.
(0, 39) and (80, 82)
0
Change in x is 80 0 10 20 30 40 50 60 70 80 90 100

English Score

m = Δy = 43 = 0.54
Δx 80 ?
Interpret the value of and in equation of the line
For every extra mark in English the maths mark increases by 0.54
The maths mark is approximately 39 when a student scores 0 in the English test.
We can actually use our calculator to input data and find a line of best fit.

Distance from Kingston (x) 0.2km 2.5km 3.6km 0.8km


House Price (y) £560,000 £470,000 £365,000 £580,000
1) The table below shows the shoe size and mass of 8 men.

Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80

100 (h) Calculate the gradient of the


line of best fit.
95
90
85 ?
Mass (kg)

80
(i) Find the equation of the line of
75
best fit.
70
This time we can’t find the -
65 intercept from the graph
60
To find sub in a know point (8.5,
4 5 6 7 8 9 10 11 12 13 78.625) hence,
Shoe Size ?
1) The table below shows the shoe size and mass of 8 men.

Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80

100
95
Interpret the values of the
90 gradient and the y-intercept.
85 Gradient:
Mass (kg)

80 As shoe size increases by 1 the


75 mass increase by 4.3kg
70 y-intercept:
?
65 A man with a shoe size of 0 has
an estimated mass of 42.2kg.
60 (Why is this value
? not very
accurate? Does this make any
4 5 6 7 8 9 10 11 12 13 sense?
Shoe Size
25

20

y = -0.18x + 17
Weekly time on internet (hours)

15

10

0
0 10 20 30 40 50 60 ? 70 80 90
Age

If someone’s age is 50, how many


hours would we therefore expect (-0.18 x 50) + 17 = 8
them to be on the internet?
In general, we should be
Earnings wary of making estimates
£80000 using values outside the
range of our data.
£70000
Estimating for this age is
£60000
bad because:
£50000 The person may have
retired. ?
£40000

£30000 Estimating for this age is


bad because:
£20000 Children don’t have full-
time jobs. ?
£10000

0 10 20 30 40 50 60 70 80 90
Age

When we use our line of best fit to estimate a ?


value inside the range of our
data, this is known as: interpolation

? outside the range of


When we use our line of best fit to estimate a value
our data, this is known as: extrapolation
Key Question
The scatter diagram shows
information about 10 apartments in
a city.
The graph shows the distance from
the city centre and the monthly rent
of each apartment.

a) Draw a line of best fit.


(2)

b) Describe and interpret


the correlation shown in
The independent
the scatter diagram
variable (x axis)
(2)
should always come
Description: (Strong)
first
Negative Correlation.
Interpretation: As the distance
from the city increases the
Monthly rent decreases
Key Question
c) Calculate the gradient of the
line of best fit.
Δ𝑦 100−400 (2)
𝑔𝑟𝑎𝑑𝑖𝑒𝑛𝑡= = =−120
Δ𝑥 3.8−1.3 (1.3, 400)
d) Write the equation of the line
of best fit in the form

𝑦 =−120 𝑥+560 (2)


The y-intercept
e) Interpret the value of a and b.
(2) The independent variable
a: As the distance from
always comes first. Make (3.8, 100)
the city increases by 1km
sure to include units.
the cost of monthly rent
decreases by £120
b: apartments in the When the distance from the city
centre of the city have a centre is 0 the monthly rent of
monthly rent of £560 an apartment is
Key Question
f) An apartment which is
5km from the city centre
has a monthly rent of
£100. Explain why using
the line of best fit to
predict the monthly rent
may not be reliable
(1)
Extrapolation: 5km is
not in the data range.

g) Why is a scatter diagram


a suitable diagram to
represent this data.
(1)
Bivariate data
The scatter diagram shows information for some weather stations. It shows the height of
each weather station above sea level (m) and the mean July midday temperature (C) for
that weather station.
Find the equation of the line of best fit. Given that the mean is value is (14, 1450)
What does the gradient mean in the context of the question?
What does the y intercept mean in the context of the question?
Is it sensible to extend the graph to x=0?
Use you equation to predict the height of a station which records a temperature of 15 oC.

mean July midday temperature (C)


Find the equation of the line of best fit.
What does the gradient mean in the context of the question?
What does the y intercept mean in the context of the question?
Is it sensible to extend the graph to x=0?
Use you equation to predict the height of a station which records a temperature of 15 oC.
§ 4.7 Spearman’s rank correlation coefficient

Shows “agreement” as apposed to correlation.


close to +1 more agreement between ranks
close to -1 more disagreement between ranks
close to Zero, the ranks neither agree or disagree.
When comparing two different relationships or sets of data.

Eg: is there more agreement between height and weight or height and arm length.
§ 4.8 Calculating Spearman’s rank correlation coefficient
Does being good at maths make you better at biology?

Student Maths exam Biology exam


score score
Anand 57 83
Bernard 45 37
Charlotte 72 41
Demi 78 86
Eustace 53 56
Ferdinand 63 85
Gemma 86 77
Hector 98 87
Ivor 59 70
Jasmine 71 59

Is there a statistically significant correlation between these two sets of results?


Does being good at maths make you better at biology?

Student Maths exam Biology exam


score score
Anand 57 83
Bernard 45 37
Charlotte 72 41
Demi 78 86
Eustace 53 56
Ferdinand 63 85
Gemma 86 77
Hector 98 87
Ivor 59 70
Jasmine 71 59

Is there a statistically significant correlation between these two sets of results?


Step 1: Rank each set of data (lowest to highest)

Student Maths Maths Biology Biology


exam rank exam score rank
score
Alex 57 3 83 7
Bernard 45 1 37 1
Charlotte 72 7 41 2
Demi 78 8 86 9
Eustace 53 2 56 3
Ferdinand 63 5 85 8
Gemma 86 9 77 6
Hector 98 10 87 10
Ivor 59 4 70 5
Jasmine 71 6 59 4
Step 2: Work out the differences in ranks (maths – biology)

Student Maths Maths Biology Biology


exam score rank exam rank d d2
score
Alex 57 3 83 7
4 16
Bernard 45 1 37 1
0 0
Charlotte 72 7 41 2
5 25
Demi 78 8 86 9
1 1
Eustace 53 2 56 3
1 1
Ferdinand 63 5 85 8
3 9
Gemma 86 9 77 6
3 9
Hector 98 10 87 10
Ivor 59 4 70 5 0 0
Jasmine 71 6 59 4 1 1
22
∑d 4
66
Step 3: Work out the square of the differences
Step 4: Work out the sum of the square of the differences
Step 5: Work out the value of the coefficient, rs

n = 10
∑d2 = 66

6(66) 6(66)
rs = 1 - =1-
10(102 – 1) 10 x 99
= 1 – 0.4 = 0.6
Step 6: comment on the value of rs
rs = 0.6 suggests a relatively moderate (agreement) positive
correlation between Maths and Biology scores.
As a maths scores increase, Biology scores also increase
Step 1: Rank each set of data
Step 2: Work out the differences in ranks (Why doesn’t it matter
what order we subtract in?)
Step 3: Work out the square of the differences
Step 4: Work out the sum of the square of the differences
Step 5: Work out the value of the coefficient, rs
Step 6: comment on the value of rs
1) The table below shows the shoe size and mass of 8 men.

Size 5 12 7 10 9 11 6 8
Mass 65 97 68 78 79 88 74 80
Rank(s) 1 8 3 6 5 7 2 4
Rank(M) 1 8 2 4 5 7 3 6
d 0 0 1 2 0 0 1 2

𝑑2 0 0 1 4 ?0 0 1 4

Calculate Spearman’s rank correlation coefficient (3d.p) and


interpret your answer.
∑ 𝑑2 =10
81

The shoe size and mass are in (strong) agreement / positive correlation
Calculate Spearman's rank Correlation
coefficient and interpret your answer.

  Mock % (a) GCSE % (b) Rank-a Rank-b d d2


Adnan 78 85        
Ben 83 93        
Carl 54 76        
Dan 77 86        
Edward 45 78        
Fred 95 97        
George 89 91        
Harry 77 75        
Ivor 77 84        
Total
Calculate Spearman's rank Correlation
coefficient and interpret your answer.

  Mock % (a) GCSE % (b) Rank-a Rank-b d d2


Adnan 78 85 6 5 1 1
Ben 83 93 7 8 1 1
Carl 54 76 2 2 0 0
Dan 77 86 4 6 2 4
Edward 45 78 1 3 2 4
Fred 95 97 9? 9? 0? ?0
George 89 91 8 7 1 1
Harry 77 75 4 1 3 9
Ivor 77 84 4 4 0 0
           Total 20

?
Strong agreement between mock and GCSE %
?33 (positive correlation)
?
The higher a students mark in the Mock the higher
their mark in the GCSE
§ 4.9 PMCC
If both variables X, Y are random samples from normal distributions (the data is
symmetrical about the mean and the samples set is chosen using a random sampling
method) then the Product Moment Correlation Coefficient (PMCC) can be calculated to
given an estimation of the correlation.

Why use PMCC?


•Gives a value between -1 and 1.
•The closer to -1 or 1 the PMCC is the stronger the correlation.
•A negative value implies negative correlation etc.
•If close to 0 does not imply no correlation. Only shows there is no linear correlation.
•(do not need to know how to calculate)

However if the variables X and Y are not random samples from a normal distribution. We
can not use PMCC

For example IRG attainment based on class test would be normally distributed but.
Teachers opinions on effort would not be.
For each of the questions below identify the most appropriate value for
Spearman's rank correlation coefficient and Persons product moment correlation
coefficient, from the list. Then explain your reasoning.
-0.95 -0.60 0.60 0.95

Spearman's = 0.95? ? Spearman's = 0


Spearman's = -0.95
Person’s = 0.95
? Person’s = 0?
Person’s = -0.6

The model is linear hence There is no agreement


Person’s will be
Spearman’s and Person’s or linear correlation,
closer to 0 as the
would give strong positive hence both ?values will
relationship is non-
correlation. be close to 0.
linear.

You might also like