Ap Stats Chapter 7 Outline

Chapter 6
Scatterplots, Association,  
and Correlation
1 /47
Chapter 6
Homework
Pg 164 1, 3, 5, 11, 19, 23, 24, 27, 29, 36
2 /47
Your Turn
3 /47
Chapter 6
Objectives
Calculate Pearson’s Product Moment Correlation Coefficient

Use TI-84 to find Pearson’s r
4 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Scatterplots
Scatterplots are often an effective display for data comprised of t wo variables.
In a scatterplot, you can see patterns, relationships bet ween variables, and even the any unusual
values sitting apart from the overall pattern.
Scatterplots are used to begin to understand any relationship bet ween variables by
providing a way to picture possible associations bet ween t wo quantitative variables.
Scatterplots are especially useful when there is a large number of data points. They provide
information about the relationship bet ween t wo variables:
5 /47
Scatterplots
As always we start our conversation with a picture. We describe a scatterplot’s characteristics:
Strength
Shape - linear, cur ved, etc.
Direction - positive or negative
Presence of outliers
When describing the scatterplot you must describe those 4 characteristics.
6 /47
Looking at Scatterplots
When looking at scatterplots, we will look for direction, form (shape), strength, and unusual
features.
When asked to describe a relationship bet ween t wo variables, we will describe for
direction, form (shape), strength, and unusual features.
Do you get the point?
7 /47
Direction
A pattern in the distribution that runs from the upper left to the
lower right (negative slope) is said to have a negative direction.
A trend in the scatterplot running from lower left to upper right

(positive slope) has positive direction.
What is important is that the behavior of one variable is, in some way, associated with the
behavior of the second variable.
8 /47
Shape (Form)
Determining the shape of a distribution of points (scatterplot) is to determine what shape
of curve best describes the scatter.
If a straight line best approximates

the scatter, the shape is linear.
If a curved line best approximates the

scatter, the shape is curvilinear.
There are other models that fit data, and the TI-84 can calculate those models,
but we will restrict ourselves to linear models.
9 /47
Strength
That line we discussed in a previous slide is critical to determining the strength of the
relationship bet ween t wo variables.
The closer your scatter (the points) comes to actually being on that line describing the shape,
the stronger the relationship.
A relationship with most points coming very

close to being on a line is a strong relationship.
A relationship with most points far from

being on a line is a weak relationship.
No relationship
10/47
Redux
A scatterplot gives you an indication of:
Strength - how close do the points come to a line.
Shape - Is that line a straight line (linear relationship)?
Direction - Is the relationship positive (positive slope) or negative (negative slope)?
Outliers - Are any points off by themselves?
Any time you have a scatterplot you must describe those four attributes.
11/47
As age increases, does the distance at which a highway sign becomes visible decrease?
The figure shows a moderate,

negative, linear direction bet ween
the age and sign legibility distance.
Additionally, as age increases, the

variability in legibility distance
has increased.
12/47
What is the relationship bet ween per capita GDP and Life Expectancy?
The figure shows a weak, positive,

linear direction bet ween the GDP
and Life Expectancy.
There is significantly less variability

in the upper values of GDP.
13/47
Form
How closely do the points approach forming a line. Holistically, how close are the points to being on
a line?
If there is a straight line (linear) relationship, it will appear as a cloud of points stretched out in a
generally consistent, straight pattern.
The narrower the ellipse, the greater the

tendency to linearity, the stronger the
relationship, and, thus, the greater value of r.
It is the tendency to linearity that we

want to quantify.
14/47
Form
If the relationship isn’t straight, but cur ves gently, while still increasing or decreasing, a linear
model would not be appropriate.
We may be able to (and often will) find ways to make a relationship more nearly linear.
But that must wait until later in the course.

15/47
Form
Ain’t no fixing this.
With this strong relationship there is no way to linearize the data.
We can find a strong, quadratic regression but that is beyond the scope of this course.
16/47
Form
If the relationship cur ves sharply,…
… the methods of this course, cannot

provide a single model.
It is possible to fit a three distinct linear models to this data. This is most assuredly
not a single linear relationship.
17/47
Strength
At one extreme of strength, the points appear to follow a single line
(whether straight, cur ved, or bending all over the place).
Strong and Linear Strong, just not linear

Strong, really not linear
18/47
Strength
At the other extreme, the points appear as a vague cloud with no discernible trend or pattern:
Overall No Relationship
But look more closely at the

different colors/shapes.
Note: we will quantify the amount of scatter soon.

19/47
Unusual Features
Look for the unexpected or the odd man out.
Sometimes the unexpected value indicates something interesting in a scatterplot of your

data that was unanticipated. Perhaps suggesting some followup.
An outlier standing away from the overall pattern of the scatterplot may suggest
something interesting, or lead you in a direction you had not thought to investigate..
Clusters or subgroups within the data should also raise questions.
20/47
Roles for Variables

It is important to determine which of the t wo quantitative variables goes on the horizontal (x)
axis and which on the vertical (y) axis.
This determination is made based on the roles played by the variables. Sometimes the roles are
not so obvious and the variables can play either role. Sometimes the roles make more sense in one
direction. Sometimes the roles are determined by what the researcher is attempting to explain.
The explanatory or predictor variable (also called the independent variable) goes on the
horizontal (x) axis.
The response variable (or dependent variable) goes on the vertical (y) axis.
21/47
Roles for Variables

The roles (explanatory or response) that you select for variables
may be arbitrary and could be more about how you perceive the
relationship rather than about the variables themselves.
Does it make more sense to think about hours predicting score

or score predicting hours? Could we reverse the roles?
Though we call a variable the predictor (or explanatory) variable, placing that variable on the x-axis
does not necessarily mean that it explains or predicts anything. The variable on the y-axis may
not respond to it in any way. In other words, there may be no association, or the association is the
result of another variable.
You will hear this often: correlation ≠ causation

22/47
Correlation
Any measure of strength of linearity should be
independent of the units of measurement used for the
variables. If you weigh yourself in pounds and
measure your height in inches, the relationship
bet ween height and weight should not change if you
measure height in meters and weight in kilograms.
The correlation coefficient is

independent of the units of measure.
So, if units do not matter, let’s remove them by calculating z-scores which have no units.
23/47
Correlation BMI = kg/m2

135 (66 female) Italian preschool children aged
29–68 months at baseline were used to
investigate what is the best measure of adiposity
BMI 3
(fat) change in growing children.
They were recruited in a Legnago (Verona,

Italy) kindergarten after excluding those with
disorders affecting growth.
BMI 1
Height and weight were measured three times for each child by the same trained observer,
at baseline, 4 and 9 months.
24/47
Correlation
BMI z score 3
BMI z-score 3 plotted against BMI z-score 1 in 135
subjects. Note the pointed shape of the scatterplot.
The points in quadrant one (both z-scores

positive) and quadrant 3 (both z-scores
negative) tend to strengthen the positive
relationship bet ween z-score 1 and z-score 3.
BMI z score 1
The points in quadrant t wo (z-score 1 negative, z-
score 3 positive) and quadrant 4 (z-score 1 positive
and z-score 3 negative) tend to weaken the positive
relationship bet ween z-score 1 and z-score 3. Points on either axis (z-score = 0) have
no effect on the relationship.
25/47
Correlation
The correlation coefficient (r) gives us a numerical measurement of the strength of the linear
relationship bet ween the explanatory and response variables.
∑z z
x y
r= i =1
n −1
“How did we get this?” you might ask. Well even if you mightn’t, I am happy to explain.
26/47
Allow me to reiterate
The correlation coefficient is a numeric measure of the linearity of your data points.
That means the correlation coefficient tells you how close the points come to forming
(or being on) a line.
In any discussion of correlation you must list the attributes mentioned earlier:
Shape, Strength, Direction.
The correlation coefficient (Pearson’s r) is a number indicating strength of the linear

relationship of the variables.
27/47
Pearson’s r
Now, for that explanation you were so eagerly awaiting:
Pearson's correlation coefficient bet ween t wo variables is defined as the covariance

of the t wo variables divided by the product of their standard deviations.
For a population the formula for finding the correlation coefficient is:
cov(x, y) E[(x − µ X )(y − µy )]

ρ= =
σ Xσ y σ Xσ y
where E is the expected value operator (mean), and cov means covariance.
28/47
Pearson’s r cov(x, y) E[(x − µ X )(y − µy )]

ρ= =
σ Xσ y σ Xσ y
The previous formula defines the population correlation coefficient, represented by the Greek letter
ρ (rho). Substituting estimates of the covariances and variances based on a sample gives the
sample correlation coefficient, r:
n
n n
1 1
∑ (X − X)(Yi − Yi ) ∑ (X i − X)(Yi − Yi ) i ∑ (X i − X)(Yi − Yi )
i
(n − 1) (n − 1) i =1
r= i =1
= i =1
i =
n n n n 1 n n
∑ i
(X − X)2
∑ i
(Y − Y)2
∑ i (X − X)2
∑ i (Y − Y)2
(n − 1) ∑ i (X − X)2
∑ i (Y − Y)2
i =1 i =1 i =1 i =1 i =1 i =1
(n − 1)
n n n
1
∑ (X i
− X)(Yi − Yi )
1
∑ (X i
− X)(Yi − Yi )
1
∑ (X i
− X)(Yi − Yi )
= i i =1
= i i =1
= i i =1
(n − 1) n n (n − 1) n n (n − 1) s X i sy
∑ (X i − X) 2
∑ (Yi − Y) 2
∑ i
(X − X)2
∑ (Y i
2
− Y)
i =1 i =1 i =1 i =1
i n
(n − 1) (n − 1) 1
∑
(n − 1) (n − 1)
r= Z x ZY
(n − 1) i =1
29/47
n
1
Pearson’s r r= ∑
(n − 1) i =1
Z x ZY
To calculate this value we would need to create several lists of data similar to the lists used finding
standard deviation.
xi yi Yi − Y (X i − X)2
(Yi − Y )2
(X i − X)(Yi − Y )
Xi − X
. . . . . . .
. . . . . . .
. . . . . . .
n n
∑X i ∑Y i SSxx SSyy SSxy

i =1 i =1
I really do not want to do all that work and I am certain you are even less enthusiastic.
So we will use the calculator.
30/47
Significance
We can always find a correlation coefficient. Given t wo equal sized sets of numbers we can create
ordered pair that will give us numbers to put into the formula to find a correlation coefficient.
Length of nose and GPA Length of hair and IQ

The question then becomes is that coefficient significant?
There are t wo kinds of “significant”.

1. Statistically significant
2. Socially significant
Statistically significant suggests the results are due to some effect and not the result of
chance.
Socially significant suggests the results are have meaning, are important and/or are large
enough to matter.
31/47
Significance
Statistically significance of r requires some assumptions be met.
1. The variables are quantitative random variables.
2. The variables are each unimodal and symmetric.
3. The variables do have a linear relationship.
4. The variables are bivariate normal (i.e. at each value of the normally distributed
independent variable, the dependent variable is normally distributed).
32/47
Correlation
For the students’ heights and weights, the correlation is 0.644.
What does this mean in terms of strength? We’ll address this shortly.
To test the significance of r, we must be familiar with hypothesis testing. For now, we will
simply state that large values of r suggest there is some association in the behavior of the
variables together.
Large values of r suggest that the behavior of one variable can be suggested from the behavior
of another variable. Notice that I did not claim the behavior of variable 1 is caused by
the behavior of variable 2. I simply notice that changes in one variable tend to be matched
by changes in the other.
33/47
Correlation Conditions
The correlation coefficient is denoted r for a sample and ρ (rho) for the population.
Correlation (most often Pearson’s Product Moment Correlation Coefficient or Pearson’s r)

quantifies the strength of the linear association bet ween t wo quantitative variables.
Before you use correlation, you must check a few conditions:

Quantitative Variables Condition
Sufficiently Linear Condition
Outlier Condition
34/47
Quantitative Variables Condition:
Correlation applies only to quantitative variables.
Don’t apply correlation to categorical data masquerading as quantitative.
Check that you know the variables’ units and what they measure.
For categorical variables there is another correlation coefficient often used, the
Spearman’s Rank Correlation Coefficient. That is beyond the scope of this class.
ALCOHOL * SICKDAYS Crosstabulation
Count
SICKDAYS
0 days 1-6 days 7+ days Total
ALCOHOL Without Risk 347 113 145 605
Hardly any Risk 154 63 56 273
Some-Considerable Risk 52 25 34 111
Total 553 201 235 989
35/47
Sufficiently Linear Condition (Linearity):
You can calculate a correlation coefficient for any pair of variables.
But correlation measures the strength only of the linear association, and will be an inappropriate
model if the relationship is not truly linear.
36/47
Outlier Condition:
Outliers can significantly affect the correlation coefficient.
It is possible, though unlikely, for an outlier to change a

positive association into a negative correlation coefficient
(and vice versa).
Only considering the blue observations:
When we include the red observations.
As usual when you see an outlier, it’s a good idea to report the correlations
with and without the outliers.
37/47
Correlation Properties
The sign of a correlation coefficient gives the direction of the association.
Correlation is always between –1 and +1.
Correlation can equal to –1 or +1, but that means all the data points fall exactly on a
single straight line, and that is unlikely in the extreme.
A correlation near zero indicates a weak, or no, linear association.
38/47
Correlation treats x and y symmetrically:
The correlation of x with y is the same as the correlation of y with x.
The correlation coefficient, r, has no units.
Correlation is not affected by changes in

the center or scale of either variable.
Correlation can be calculated by using only z-scores, and z-scores are unaffected by
changes in center or scale.
39/47
Correlation measures the strength of the linear association bet ween the t wo variables. Notice
how many times I have mentioned that fact.
Variables may have a strong association

but small Pearson’s r because the
association is not linear.
Correlation is sensitive to outliers. Outliers have

inordinately large effects on Pearson’s r. Outliers tend to
have large leverage and can make a weak association look
stronger with a large r.
40/47
Correlation ≠ Causation
Whenever we have a strong correlation, it is tempting to explain it by claiming that the predictor
variable has caused the changes in the response variable.
Correlation is NOT causation, be very careful how you express the relationship. Do NOT use causal
language.
Scatterplots and correlation coefficients never, never, never suggest causation.
A hidden variable that stands behind a relationship and creates the illusion of a relationship by
simultaneously affecting the other t wo variables is called a lurking (or confounding) variable.
e.g. The number of churches in a community is strongly correlated with a high amount of
criminal activity in that community.
41/47
Correlation Matrices (Tables)

It is common in some fields to compute the correlations bet ween each pair of variables in a
collection of variables and arrange these correlations in a table.
I do not like these correlation matrices because they imply an equal significance for each
of the relationships that is most likely not justified.
42/47
Watch Out For These Common Mistakes

Don’t say “correlation” when you mean “association.”
More often than not, people say correlation when they mean association.
The word “correlation” should be reserved for measuring the strength and direction of the linear
relationship bet ween t wo quantitative variables.
We often run into that kind of error when t wo variables are said to be “correlated” when they
actually mean “dependent”.
Don’t correlate categorical variables. Use a different statistic (Spearman’s Rank Correlation
Coefficient).
Be sure the association is linear.

There may be a strong association bet ween t wo variables that have a nonlinear association.
43/47
Watch Out For These Common Mistakes

Don’t assume the relationship is linear just because the correlation coefficient is high.
Here the correlation is 0.979, but the relationship

is actually cur vilinear.
Treat outliers with respect. A single outlier with

leverage can dominate the correlation coefficient.
44/47
TI-84
Now I suppose you would like to use the calculator to do all the work for you. To find the correlation
coefficient (r) on the TI you have to prepare the calculator to report r.
If you want the TI-84 to calculate the Pearson correlation coefficient r, you must turn
“Diagnostics” ON:
catalog
2nd 0 DiagnosticOn ENTER ENTER
➢
45/47
TI-84
Let us create some data to calculate a correlation coefficient.
Enter the data from the table into t wo lists. For the first list
enter the value (year - 1900), so 1920 = 20.
Stat 1:Edit Select List “L1” Enter first value “20” Enter
Enter 2nd value “30” Enter Repeat to end of list
Move to List “L2” Enter first value “54.1” Enter Repeat to end of list 2nd Quit
46/47
TI-84
Let us look at the scatterplot
STAT PLOT L1 L2
2nd y= Enter ON 1:Plot1 Enter Type: ➢ XList: 2nd 1 YList: 2nd 2
➢
Quit
2nd mode Zoom 9:Zoomstat 84
78
Life Expectancy
72
66
60
54
20 40 60 80 100
Year - 1900
47/47
TI-84
To have the TI-84 plot the points created by the t wo lists:
2nd y= (STAT PLOT) Enter ON TYPE” Dotplot XList 2nd 1 YList 2nd 2 Mark
Zoom 9
To have the TI-84 calculate Pierson’s r: Xlist: L1 2nd 1
Ylist: L2 2nd 2
STAT ➢ CALC 4:LinReg(ax+b) Enter
➢ ➢
FreqList:
STAT ➢ CALC 8:LinReg(a+bx) Enter
Store RegEQ:
Calculate Enter
y=ax+b
a=.2718
b=51.65444444
r2=.9468452064
r=.9730597137 What is this telling us?
48/47

Ap Stats Chapter 7 Outline

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ap Stats Chapter 7 Outline

Uploaded by

Copyright:

Available Formats

Chapter 6

Calculate Pearson’s Product Moment Correlation Coefficient

When describing the scatterplot you must describe those 4 characteristics.

Do you get the point?

A trend in the scatterplot running from lower left to upper right

If a straight line best approximates

If a curved line best approximates the

A relationship with most points coming very

A relationship with most points far from

The figure shows a moderate,

Additionally, as age increases, the

The figure shows a weak, positive,

There is significantly less variability

The narrower the ellipse, the greater the

It is the tendency to linearity that we

But that must wait until later in the course.

With this strong relationship there is no way to linearize the data.

… the methods of this course, cannot

(whether straight, cur ved, or bending all over the place).

Strong and Linear Strong, just not linear

But look more closely at the

Note: we will quantify the amount of scatter soon.

Sometimes the unexpected value indicates something interesting in a scatterplot of your

Clusters or subgroups within the data should also raise questions.

Roles for Variables

Roles for Variables

Does it make more sense to think about hours predicting score

You will hear this often: correlation ≠ causation

The correlation coefficient is

Correlation BMI = kg/m2

They were recruited in a Legnago (Verona,

The points in quadrant one (both z-scores

The correlation coefficient (Pearson’s r) is a number indicating strength of the linear

Pearson's correlation coefficient bet ween t wo variables is defined as the covariance

cov(x, y) E[(x − µ X )(y − µy )]

Pearson’s r cov(x, y) E[(x − µ X )(y − µy )]

∑X i ∑Y i SSxx SSyy SSxy

Length of nose and GPA Length of hair and IQ

There are t wo kinds of “significant”.

Correlation (most often Pearson’s Product Moment Correlation Coefficient or Pearson’s r)

Before you use correlation, you must check a few conditions:

It is possible, though unlikely, for an outlier to change a

Only considering the blue observations:

When we include the red observations.

Correlation is always between –1 and +1.

A correlation near zero indicates a weak, or no, linear association.

The correlation coefficient, r, has no units.

Correlation is not affected by changes in

Variables may have a strong association

Correlation is sensitive to outliers. Outliers have

Scatterplots and correlation coefficients never, never, never suggest causation.

Correlation Matrices (Tables)

Watch Out For These Common Mistakes

Be sure the association is linear.

Watch Out For These Common Mistakes

Here the correlation is 0.979, but the relationship

Treat outliers with respect. A single outlier with

Enter 2nd value “30” Enter Repeat to end of list

2nd y= Enter ON 1:Plot1 Enter Type: ➢ XList: 2nd 1 YList: 2nd 2

To have the TI-84 calculate Pierson’s r: Xlist: L1 2nd 1

You might also like