Download as pdf or txt
Download as pdf or txt
You are on page 1of 48

Chapter 6

Scatterplots, Association, 

and Correlation

1 /47
Chapter 6
Homework
Pg 164 1, 3, 5, 11, 19, 23, 24, 27, 29, 36

2 /47
Your Turn

3 /47
Chapter 6

Objectives

Calculate Pearson’s Product Moment Correlation Coefficient


Use TI-84 to find Pearson’s r

4 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Scatterplots
Scatterplots are often an effective display for data comprised of t wo variables.
In a scatterplot, you can see patterns, relationships bet ween variables, and even the any unusual
values sitting apart from the overall pattern.

Scatterplots are used to begin to understand any relationship bet ween variables by
providing a way to picture possible associations bet ween t wo quantitative variables.

Scatterplots are especially useful when there is a large number of data points. They provide
information about the relationship bet ween t wo variables:

5 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Scatterplots
As always we start our conversation with a picture. We describe a scatterplot’s characteristics:
Strength
Shape - linear, cur ved, etc.
Direction - positive or negative
Presence of outliers

When describing the scatterplot you must describe those 4 characteristics.

6 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Looking at Scatterplots
When looking at scatterplots, we will look for direction, form (shape), strength, and unusual
features.

When asked to describe a relationship bet ween t wo variables, we will describe for
direction, form (shape), strength, and unusual features.

Do you get the point?

7 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Direction
A pattern in the distribution that runs from the upper left to the
lower right (negative slope) is said to have a negative direction.

A trend in the scatterplot running from lower left to upper right


(positive slope) has positive direction.

What is important is that the behavior of one variable is, in some way, associated with the
behavior of the second variable.
8 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Shape (Form)
Determining the shape of a distribution of points (scatterplot) is to determine what shape
of curve best describes the scatter.

If a straight line best approximates


the scatter, the shape is linear.

If a curved line best approximates the


scatter, the shape is curvilinear.

There are other models that fit data, and the TI-84 can calculate those models,
but we will restrict ourselves to linear models.
9 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Strength
That line we discussed in a previous slide is critical to determining the strength of the
relationship bet ween t wo variables.
The closer your scatter (the points) comes to actually being on that line describing the shape,
the stronger the relationship.

A relationship with most points coming very


close to being on a line is a strong relationship.

A relationship with most points far from


being on a line is a weak relationship.

No relationship

10/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Redux
A scatterplot gives you an indication of:
Strength - how close do the points come to a line.
Shape - Is that line a straight line (linear relationship)?
Direction - Is the relationship positive (positive slope) or negative (negative slope)?
Outliers - Are any points off by themselves?

Any time you have a scatterplot you must describe those four attributes.

11/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Looking at Scatterplots
As age increases, does the distance at which a highway sign becomes visible decrease?

The figure shows a moderate,


negative, linear direction bet ween
the age and sign legibility distance.

Additionally, as age increases, the


variability in legibility distance
has increased.

12/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Looking at Scatterplots
What is the relationship bet ween per capita GDP and Life Expectancy?

The figure shows a weak, positive,


linear direction bet ween the GDP
and Life Expectancy.

There is significantly less variability


in the upper values of GDP.

13/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Form
How closely do the points approach forming a line. Holistically, how close are the points to being on
a line?

If there is a straight line (linear) relationship, it will appear as a cloud of points stretched out in a
generally consistent, straight pattern.

The narrower the ellipse, the greater the


tendency to linearity, the stronger the
relationship, and, thus, the greater value of r.

It is the tendency to linearity that we


want to quantify.

14/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Form
If the relationship isn’t straight, but cur ves gently, while still increasing or decreasing, a linear
model would not be appropriate.

We may be able to (and often will) find ways to make a relationship more nearly linear.

But that must wait until later in the course.


15/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Form
Ain’t no fixing this.

With this strong relationship there is no way to linearize the data.

We can find a strong, quadratic regression but that is beyond the scope of this course.

16/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Form
If the relationship cur ves sharply,…

… the methods of this course, cannot


provide a single model.

It is possible to fit a three distinct linear models to this data. This is most assuredly
not a single linear relationship.

17/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Strength
At one extreme of strength, the points appear to follow a single line

(whether straight, cur ved, or bending all over the place).

Strong and Linear Strong, just not linear


Strong, really not linear

18/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Strength
At the other extreme, the points appear as a vague cloud with no discernible trend or pattern:

Overall No Relationship

But look more closely at the


different colors/shapes.

Note: we will quantify the amount of scatter soon.


19/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Unusual Features
Look for the unexpected or the odd man out.

Sometimes the unexpected value indicates something interesting in a scatterplot of your


data that was unanticipated. Perhaps suggesting some followup.

An outlier standing away from the overall pattern of the scatterplot may suggest
something interesting, or lead you in a direction you had not thought to investigate..

Clusters or subgroups within the data should also raise questions.

20/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Roles for Variables


It is important to determine which of the t wo quantitative variables goes on the horizontal (x)
axis and which on the vertical (y) axis.

This determination is made based on the roles played by the variables. Sometimes the roles are
not so obvious and the variables can play either role. Sometimes the roles make more sense in one
direction. Sometimes the roles are determined by what the researcher is attempting to explain.

The explanatory or predictor variable (also called the independent variable) goes on the
horizontal (x) axis.

The response variable (or dependent variable) goes on the vertical (y) axis.

21/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Roles for Variables


The roles (explanatory or response) that you select for variables
may be arbitrary and could be more about how you perceive the
relationship rather than about the variables themselves.

Does it make more sense to think about hours predicting score


or score predicting hours? Could we reverse the roles?

Though we call a variable the predictor (or explanatory) variable, placing that variable on the x-axis
does not necessarily mean that it explains or predicts anything. The variable on the y-axis may
not respond to it in any way. In other words, there may be no association, or the association is the
result of another variable.

You will hear this often: correlation ≠ causation


22/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation
Any measure of strength of linearity should be
independent of the units of measurement used for the
variables. If you weigh yourself in pounds and
measure your height in inches, the relationship
bet ween height and weight should not change if you
measure height in meters and weight in kilograms.

The correlation coefficient is


independent of the units of measure.

So, if units do not matter, let’s remove them by calculating z-scores which have no units.
23/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation BMI = kg/m2


135 (66 female) Italian preschool children aged
29–68 months at baseline were used to
investigate what is the best measure of adiposity

BMI 3
(fat) change in growing children.

They were recruited in a Legnago (Verona,


Italy) kindergarten after excluding those with
disorders affecting growth.
BMI 1

Height and weight were measured three times for each child by the same trained observer,
at baseline, 4 and 9 months.
24/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation

BMI z score 3
BMI z-score 3 plotted against BMI z-score 1 in 135
subjects. Note the pointed shape of the scatterplot.

The points in quadrant one (both z-scores


positive) and quadrant 3 (both z-scores
negative) tend to strengthen the positive
relationship bet ween z-score 1 and z-score 3.

BMI z score 1
The points in quadrant t wo (z-score 1 negative, z-
score 3 positive) and quadrant 4 (z-score 1 positive
and z-score 3 negative) tend to weaken the positive
relationship bet ween z-score 1 and z-score 3. Points on either axis (z-score = 0) have
no effect on the relationship.
25/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation
The correlation coefficient (r) gives us a numerical measurement of the strength of the linear
relationship bet ween the explanatory and response variables.

∑z z
x y
r= i =1

n −1

“How did we get this?” you might ask. Well even if you mightn’t, I am happy to explain.

26/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Allow me to reiterate
The correlation coefficient is a numeric measure of the linearity of your data points.

That means the correlation coefficient tells you how close the points come to forming
(or being on) a line.

In any discussion of correlation you must list the attributes mentioned earlier:
Shape, Strength, Direction.

The correlation coefficient (Pearson’s r) is a number indicating strength of the linear


relationship of the variables.

27/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Pearson’s r
Now, for that explanation you were so eagerly awaiting:

Pearson's correlation coefficient bet ween t wo variables is defined as the covariance


of the t wo variables divided by the product of their standard deviations.

For a population the formula for finding the correlation coefficient is:

cov(x, y) E[(x − µ X )(y − µy )]


ρ= =
σ Xσ y σ Xσ y

where E is the expected value operator (mean), and cov means covariance.

28/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Pearson’s r cov(x, y) E[(x − µ X )(y − µy )]


ρ= =
σ Xσ y σ Xσ y

The previous formula defines the population correlation coefficient, represented by the Greek letter
ρ (rho). Substituting estimates of the covariances and variances based on a sample gives the
sample correlation coefficient, r:
n
n n
1 1
∑ (X − X)(Yi − Yi ) ∑ (X i − X)(Yi − Yi ) i ∑ (X i − X)(Yi − Yi )
i
(n − 1) (n − 1) i =1
r= i =1
= i =1
i =
n n n n 1 n n

∑ i
(X − X)2
∑ i
(Y − Y)2
∑ i (X − X)2
∑ i (Y − Y)2
(n − 1) ∑ i (X − X)2
∑ i (Y − Y)2

i =1 i =1 i =1 i =1 i =1 i =1

(n − 1)
n n n

1
∑ (X i
− X)(Yi − Yi )
1
∑ (X i
− X)(Yi − Yi )
1
∑ (X i
− X)(Yi − Yi )
= i i =1
= i i =1
= i i =1

(n − 1) n n (n − 1) n n (n − 1) s X i sy
∑ (X i − X) 2
∑ (Yi − Y) 2
∑ i
(X − X)2
∑ (Y i
2
− Y)
i =1 i =1 i =1 i =1
i n
(n − 1) (n − 1) 1

(n − 1) (n − 1)
r= Z x ZY
(n − 1) i =1
29/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
n
1
Pearson’s r r= ∑
(n − 1) i =1
Z x ZY

To calculate this value we would need to create several lists of data similar to the lists used finding
standard deviation.
xi yi Yi − Y (X i − X)2
(Yi − Y )2
(X i − X)(Yi − Y )
Xi − X
. . . . . . .
. . . . . . .
. . . . . . .
n n

∑X i ∑Y i SSxx SSyy SSxy


i =1 i =1

I really do not want to do all that work and I am certain you are even less enthusiastic.
So we will use the calculator.

30/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Significance
We can always find a correlation coefficient. Given t wo equal sized sets of numbers we can create
ordered pair that will give us numbers to put into the formula to find a correlation coefficient.

Length of nose and GPA Length of hair and IQ


The question then becomes is that coefficient significant?

There are t wo kinds of “significant”.


1. Statistically significant
2. Socially significant

Statistically significant suggests the results are due to some effect and not the result of
chance.

Socially significant suggests the results are have meaning, are important and/or are large
enough to matter.
31/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Significance
Statistically significance of r requires some assumptions be met.
1. The variables are quantitative random variables.
2. The variables are each unimodal and symmetric.
3. The variables do have a linear relationship.
4. The variables are bivariate normal (i.e. at each value of the normally distributed
independent variable, the dependent variable is normally distributed).

32/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation
For the students’ heights and weights, the correlation is 0.644.
What does this mean in terms of strength? We’ll address this shortly.

To test the significance of r, we must be familiar with hypothesis testing. For now, we will
simply state that large values of r suggest there is some association in the behavior of the
variables together.

Large values of r suggest that the behavior of one variable can be suggested from the behavior
of another variable. Notice that I did not claim the behavior of variable 1 is caused by
the behavior of variable 2. I simply notice that changes in one variable tend to be matched
by changes in the other.

33/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation Conditions
The correlation coefficient is denoted r for a sample and ρ (rho) for the population.

Correlation (most often Pearson’s Product Moment Correlation Coefficient or Pearson’s r)


quantifies the strength of the linear association bet ween t wo quantitative variables.

Before you use correlation, you must check a few conditions:


Quantitative Variables Condition
Sufficiently Linear Condition
Outlier Condition

34/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation Conditions
Quantitative Variables Condition:
Correlation applies only to quantitative variables.
Don’t apply correlation to categorical data masquerading as quantitative.
Check that you know the variables’ units and what they measure.

For categorical variables there is another correlation coefficient often used, the
Spearman’s Rank Correlation Coefficient. That is beyond the scope of this class.
ALCOHOL * SICKDAYS Crosstabulation

Count
SICKDAYS
0 days 1-6 days 7+ days Total
ALCOHOL Without Risk 347 113 145 605
Hardly any Risk 154 63 56 273
Some-Considerable Risk 52 25 34 111
Total 553 201 235 989
35/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation Conditions
Sufficiently Linear Condition (Linearity):
You can calculate a correlation coefficient for any pair of variables.
But correlation measures the strength only of the linear association, and will be an inappropriate
model if the relationship is not truly linear.

36/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation Conditions
Outlier Condition:
Outliers can significantly affect the correlation coefficient.

It is possible, though unlikely, for an outlier to change a


positive association into a negative correlation coefficient
(and vice versa).

Only considering the blue observations:

When we include the red observations.

As usual when you see an outlier, it’s a good idea to report the correlations
with and without the outliers.
37/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation Properties
The sign of a correlation coefficient gives the direction of the association.

Correlation is always between –1 and +1.

Correlation can equal to –1 or +1, but that means all the data points fall exactly on a
single straight line, and that is unlikely in the extreme.

A correlation near zero indicates a weak, or no, linear association.

38/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation Properties
Correlation treats x and y symmetrically:
The correlation of x with y is the same as the correlation of y with x.

The correlation coefficient, r, has no units.

Correlation is not affected by changes in


the center or scale of either variable.

Correlation can be calculated by using only z-scores, and z-scores are unaffected by
changes in center or scale.

39/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation Properties
Correlation measures the strength of the linear association bet ween the t wo variables. Notice
how many times I have mentioned that fact.

Variables may have a strong association


but small Pearson’s r because the
association is not linear.

Correlation is sensitive to outliers. Outliers have


inordinately large effects on Pearson’s r. Outliers tend to
have large leverage and can make a weak association look
stronger with a large r.
40/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation ≠ Causation
Whenever we have a strong correlation, it is tempting to explain it by claiming that the predictor
variable has caused the changes in the response variable.
Correlation is NOT causation, be very careful how you express the relationship. Do NOT use causal
language.

Scatterplots and correlation coefficients never, never, never suggest causation.

A hidden variable that stands behind a relationship and creates the illusion of a relationship by
simultaneously affecting the other t wo variables is called a lurking (or confounding) variable.

e.g. The number of churches in a community is strongly correlated with a high amount of
criminal activity in that community.

41/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Correlation Matrices (Tables)


It is common in some fields to compute the correlations bet ween each pair of variables in a
collection of variables and arrange these correlations in a table.

I do not like these correlation matrices because they imply an equal significance for each
of the relationships that is most likely not justified.
42/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Watch Out For These Common Mistakes


Don’t say “correlation” when you mean “association.”
More often than not, people say correlation when they mean association.
The word “correlation” should be reserved for measuring the strength and direction of the linear
relationship bet ween t wo quantitative variables.

We often run into that kind of error when t wo variables are said to be “correlated” when they
actually mean “dependent”.

Don’t correlate categorical variables. Use a different statistic (Spearman’s Rank Correlation
Coefficient).

Be sure the association is linear.


There may be a strong association bet ween t wo variables that have a nonlinear association.

43/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

Watch Out For These Common Mistakes


Don’t assume the relationship is linear just because the correlation coefficient is high.

Here the correlation is 0.979, but the relationship


is actually cur vilinear.

Treat outliers with respect. A single outlier with


leverage can dominate the correlation coefficient.

44/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

TI-84
Now I suppose you would like to use the calculator to do all the work for you. To find the correlation
coefficient (r) on the TI you have to prepare the calculator to report r.

If you want the TI-84 to calculate the Pearson correlation coefficient r, you must turn
“Diagnostics” ON:

catalog
2nd 0 DiagnosticOn ENTER ENTER

45/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

TI-84
Let us create some data to calculate a correlation coefficient.
Enter the data from the table into t wo lists. For the first list
enter the value (year - 1900), so 1920 = 20.

Stat 1:Edit Select List “L1” Enter first value “20” Enter

Enter 2nd value “30” Enter Repeat to end of list

Move to List “L2” Enter first value “54.1” Enter Repeat to end of list 2nd Quit

46/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

TI-84
Let us look at the scatterplot

STAT PLOT L1 L2

2nd y= Enter ON 1:Plot1 Enter Type: ➢ XList: 2nd 1 YList: 2nd 2


Quit
2nd mode Zoom 9:Zoomstat 84

78

Life Expectancy
72

66

60

54

20 40 60 80 100
Year - 1900
47/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.

TI-84
To have the TI-84 plot the points created by the t wo lists:

2nd y= (STAT PLOT) Enter ON TYPE” Dotplot XList 2nd 1 YList 2nd 2 Mark
Zoom 9

To have the TI-84 calculate Pierson’s r: Xlist: L1 2nd 1

Ylist: L2 2nd 2
STAT ➢ CALC 4:LinReg(ax+b) Enter

➢ ➢
FreqList:
STAT ➢ CALC 8:LinReg(a+bx) Enter
Store RegEQ:
Calculate Enter
y=ax+b
a=.2718
b=51.65444444
r2=.9468452064
r=.9730597137 What is this telling us?
48/47

You might also like