Professional Documents
Culture Documents
Ap Stats Chapter 7 Outline
Ap Stats Chapter 7 Outline
Scatterplots, Association,
and Correlation
1 /47
Chapter 6
Homework
Pg 164 1, 3, 5, 11, 19, 23, 24, 27, 29, 36
2 /47
Your Turn
3 /47
Chapter 6
Objectives
4 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Scatterplots
Scatterplots are often an effective display for data comprised of t wo variables.
In a scatterplot, you can see patterns, relationships bet ween variables, and even the any unusual
values sitting apart from the overall pattern.
Scatterplots are used to begin to understand any relationship bet ween variables by
providing a way to picture possible associations bet ween t wo quantitative variables.
Scatterplots are especially useful when there is a large number of data points. They provide
information about the relationship bet ween t wo variables:
5 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Scatterplots
As always we start our conversation with a picture. We describe a scatterplot’s characteristics:
Strength
Shape - linear, cur ved, etc.
Direction - positive or negative
Presence of outliers
6 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Looking at Scatterplots
When looking at scatterplots, we will look for direction, form (shape), strength, and unusual
features.
When asked to describe a relationship bet ween t wo variables, we will describe for
direction, form (shape), strength, and unusual features.
7 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Direction
A pattern in the distribution that runs from the upper left to the
lower right (negative slope) is said to have a negative direction.
What is important is that the behavior of one variable is, in some way, associated with the
behavior of the second variable.
8 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Shape (Form)
Determining the shape of a distribution of points (scatterplot) is to determine what shape
of curve best describes the scatter.
There are other models that fit data, and the TI-84 can calculate those models,
but we will restrict ourselves to linear models.
9 /47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Strength
That line we discussed in a previous slide is critical to determining the strength of the
relationship bet ween t wo variables.
The closer your scatter (the points) comes to actually being on that line describing the shape,
the stronger the relationship.
No relationship
10/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Redux
A scatterplot gives you an indication of:
Strength - how close do the points come to a line.
Shape - Is that line a straight line (linear relationship)?
Direction - Is the relationship positive (positive slope) or negative (negative slope)?
Outliers - Are any points off by themselves?
Any time you have a scatterplot you must describe those four attributes.
11/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Looking at Scatterplots
As age increases, does the distance at which a highway sign becomes visible decrease?
12/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Looking at Scatterplots
What is the relationship bet ween per capita GDP and Life Expectancy?
13/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Form
How closely do the points approach forming a line. Holistically, how close are the points to being on
a line?
If there is a straight line (linear) relationship, it will appear as a cloud of points stretched out in a
generally consistent, straight pattern.
14/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Form
If the relationship isn’t straight, but cur ves gently, while still increasing or decreasing, a linear
model would not be appropriate.
We may be able to (and often will) find ways to make a relationship more nearly linear.
Form
Ain’t no fixing this.
We can find a strong, quadratic regression but that is beyond the scope of this course.
16/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Form
If the relationship cur ves sharply,…
It is possible to fit a three distinct linear models to this data. This is most assuredly
not a single linear relationship.
17/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Strength
At one extreme of strength, the points appear to follow a single line
18/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Strength
At the other extreme, the points appear as a vague cloud with no discernible trend or pattern:
Overall No Relationship
Unusual Features
Look for the unexpected or the odd man out.
An outlier standing away from the overall pattern of the scatterplot may suggest
something interesting, or lead you in a direction you had not thought to investigate..
20/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
This determination is made based on the roles played by the variables. Sometimes the roles are
not so obvious and the variables can play either role. Sometimes the roles make more sense in one
direction. Sometimes the roles are determined by what the researcher is attempting to explain.
The explanatory or predictor variable (also called the independent variable) goes on the
horizontal (x) axis.
The response variable (or dependent variable) goes on the vertical (y) axis.
21/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Though we call a variable the predictor (or explanatory) variable, placing that variable on the x-axis
does not necessarily mean that it explains or predicts anything. The variable on the y-axis may
not respond to it in any way. In other words, there may be no association, or the association is the
result of another variable.
Correlation
Any measure of strength of linearity should be
independent of the units of measurement used for the
variables. If you weigh yourself in pounds and
measure your height in inches, the relationship
bet ween height and weight should not change if you
measure height in meters and weight in kilograms.
So, if units do not matter, let’s remove them by calculating z-scores which have no units.
23/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
BMI 3
(fat) change in growing children.
Height and weight were measured three times for each child by the same trained observer,
at baseline, 4 and 9 months.
24/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation
BMI z score 3
BMI z-score 3 plotted against BMI z-score 1 in 135
subjects. Note the pointed shape of the scatterplot.
BMI z score 1
The points in quadrant t wo (z-score 1 negative, z-
score 3 positive) and quadrant 4 (z-score 1 positive
and z-score 3 negative) tend to weaken the positive
relationship bet ween z-score 1 and z-score 3. Points on either axis (z-score = 0) have
no effect on the relationship.
25/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation
The correlation coefficient (r) gives us a numerical measurement of the strength of the linear
relationship bet ween the explanatory and response variables.
∑z z
x y
r= i =1
n −1
“How did we get this?” you might ask. Well even if you mightn’t, I am happy to explain.
26/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Allow me to reiterate
The correlation coefficient is a numeric measure of the linearity of your data points.
That means the correlation coefficient tells you how close the points come to forming
(or being on) a line.
In any discussion of correlation you must list the attributes mentioned earlier:
Shape, Strength, Direction.
27/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Pearson’s r
Now, for that explanation you were so eagerly awaiting:
For a population the formula for finding the correlation coefficient is:
where E is the expected value operator (mean), and cov means covariance.
28/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
The previous formula defines the population correlation coefficient, represented by the Greek letter
ρ (rho). Substituting estimates of the covariances and variances based on a sample gives the
sample correlation coefficient, r:
n
n n
1 1
∑ (X − X)(Yi − Yi ) ∑ (X i − X)(Yi − Yi ) i ∑ (X i − X)(Yi − Yi )
i
(n − 1) (n − 1) i =1
r= i =1
= i =1
i =
n n n n 1 n n
∑ i
(X − X)2
∑ i
(Y − Y)2
∑ i (X − X)2
∑ i (Y − Y)2
(n − 1) ∑ i (X − X)2
∑ i (Y − Y)2
i =1 i =1 i =1 i =1 i =1 i =1
(n − 1)
n n n
1
∑ (X i
− X)(Yi − Yi )
1
∑ (X i
− X)(Yi − Yi )
1
∑ (X i
− X)(Yi − Yi )
= i i =1
= i i =1
= i i =1
(n − 1) n n (n − 1) n n (n − 1) s X i sy
∑ (X i − X) 2
∑ (Yi − Y) 2
∑ i
(X − X)2
∑ (Y i
2
− Y)
i =1 i =1 i =1 i =1
i n
(n − 1) (n − 1) 1
∑
(n − 1) (n − 1)
r= Z x ZY
(n − 1) i =1
29/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
n
1
Pearson’s r r= ∑
(n − 1) i =1
Z x ZY
To calculate this value we would need to create several lists of data similar to the lists used finding
standard deviation.
xi yi Yi − Y (X i − X)2
(Yi − Y )2
(X i − X)(Yi − Y )
Xi − X
. . . . . . .
. . . . . . .
. . . . . . .
n n
I really do not want to do all that work and I am certain you are even less enthusiastic.
So we will use the calculator.
30/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Significance
We can always find a correlation coefficient. Given t wo equal sized sets of numbers we can create
ordered pair that will give us numbers to put into the formula to find a correlation coefficient.
Statistically significant suggests the results are due to some effect and not the result of
chance.
Socially significant suggests the results are have meaning, are important and/or are large
enough to matter.
31/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Significance
Statistically significance of r requires some assumptions be met.
1. The variables are quantitative random variables.
2. The variables are each unimodal and symmetric.
3. The variables do have a linear relationship.
4. The variables are bivariate normal (i.e. at each value of the normally distributed
independent variable, the dependent variable is normally distributed).
32/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation
For the students’ heights and weights, the correlation is 0.644.
What does this mean in terms of strength? We’ll address this shortly.
To test the significance of r, we must be familiar with hypothesis testing. For now, we will
simply state that large values of r suggest there is some association in the behavior of the
variables together.
Large values of r suggest that the behavior of one variable can be suggested from the behavior
of another variable. Notice that I did not claim the behavior of variable 1 is caused by
the behavior of variable 2. I simply notice that changes in one variable tend to be matched
by changes in the other.
33/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation Conditions
The correlation coefficient is denoted r for a sample and ρ (rho) for the population.
34/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation Conditions
Quantitative Variables Condition:
Correlation applies only to quantitative variables.
Don’t apply correlation to categorical data masquerading as quantitative.
Check that you know the variables’ units and what they measure.
For categorical variables there is another correlation coefficient often used, the
Spearman’s Rank Correlation Coefficient. That is beyond the scope of this class.
ALCOHOL * SICKDAYS Crosstabulation
Count
SICKDAYS
0 days 1-6 days 7+ days Total
ALCOHOL Without Risk 347 113 145 605
Hardly any Risk 154 63 56 273
Some-Considerable Risk 52 25 34 111
Total 553 201 235 989
35/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation Conditions
Sufficiently Linear Condition (Linearity):
You can calculate a correlation coefficient for any pair of variables.
But correlation measures the strength only of the linear association, and will be an inappropriate
model if the relationship is not truly linear.
36/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation Conditions
Outlier Condition:
Outliers can significantly affect the correlation coefficient.
As usual when you see an outlier, it’s a good idea to report the correlations
with and without the outliers.
37/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation Properties
The sign of a correlation coefficient gives the direction of the association.
Correlation can equal to –1 or +1, but that means all the data points fall exactly on a
single straight line, and that is unlikely in the extreme.
38/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation Properties
Correlation treats x and y symmetrically:
The correlation of x with y is the same as the correlation of y with x.
Correlation can be calculated by using only z-scores, and z-scores are unaffected by
changes in center or scale.
39/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
Correlation Properties
Correlation measures the strength of the linear association bet ween the t wo variables. Notice
how many times I have mentioned that fact.
Correlation ≠ Causation
Whenever we have a strong correlation, it is tempting to explain it by claiming that the predictor
variable has caused the changes in the response variable.
Correlation is NOT causation, be very careful how you express the relationship. Do NOT use causal
language.
A hidden variable that stands behind a relationship and creates the illusion of a relationship by
simultaneously affecting the other t wo variables is called a lurking (or confounding) variable.
e.g. The number of churches in a community is strongly correlated with a high amount of
criminal activity in that community.
41/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
I do not like these correlation matrices because they imply an equal significance for each
of the relationships that is most likely not justified.
42/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
We often run into that kind of error when t wo variables are said to be “correlated” when they
actually mean “dependent”.
Don’t correlate categorical variables. Use a different statistic (Spearman’s Rank Correlation
Coefficient).
43/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
44/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
TI-84
Now I suppose you would like to use the calculator to do all the work for you. To find the correlation
coefficient (r) on the TI you have to prepare the calculator to report r.
If you want the TI-84 to calculate the Pearson correlation coefficient r, you must turn
“Diagnostics” ON:
catalog
2nd 0 DiagnosticOn ENTER ENTER
➢
45/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
TI-84
Let us create some data to calculate a correlation coefficient.
Enter the data from the table into t wo lists. For the first list
enter the value (year - 1900), so 1920 = 20.
Stat 1:Edit Select List “L1” Enter first value “20” Enter
Move to List “L2” Enter first value “54.1” Enter Repeat to end of list 2nd Quit
46/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
TI-84
Let us look at the scatterplot
STAT PLOT L1 L2
➢
Quit
2nd mode Zoom 9:Zoomstat 84
78
Life Expectancy
72
66
60
54
20 40 60 80 100
Year - 1900
47/47
Objective: Students will estimate Pearson’s r for a scatterplot and use the TI-84 to find r.
TI-84
To have the TI-84 plot the points created by the t wo lists:
2nd y= (STAT PLOT) Enter ON TYPE” Dotplot XList 2nd 1 YList 2nd 2 Mark
Zoom 9
Ylist: L2 2nd 2
STAT ➢ CALC 4:LinReg(ax+b) Enter
➢ ➢
FreqList:
STAT ➢ CALC 8:LinReg(a+bx) Enter
Store RegEQ:
Calculate Enter
y=ax+b
a=.2718
b=51.65444444
r2=.9468452064
r=.9730597137 What is this telling us?
48/47