Professional Documents
Culture Documents
Chapter 2
Chapter 2
Chapter 2:
Looking at Data–Relationships
Introduction
2.4 Least-Squares Regression
2.5 Cautions about Correlation and Regression
2. 6 Data Analysis for Two-Way Tables
Objectives
➢ Relationships
➢ Scatterplots
➢ Correlation
✓ There is a clear association between the size of the Mocha and its price.
➢
5 Many interesting examples of the use of statistics involve
relationships between pairs of variables.
➢ The most useful graph for displaying the relationship between two
quantitative variables on the same individuals is a scatterplot.
7 12/8/2022
Copyright© Nahid Sultana 2017-2018
Interpreting Scatterplots
8
Linear
No relationship
Nonlinear
18
Correlation
19
The correlation coefficient "r"
20
Properties of Correlation
➢ r is always a no. between –1 and 1.
➢ r > 0 indicates a positive association.
r < 0 indicates a negative association.
➢ Values of r near 0 indicate a very
weak linear relationship.
➢ The strength of the linear relationship
increases as r moves away from 0
toward –1 or 1.
➢ The extreme values r = –1 and r = 1
occur only in the case of a perfect
linear relationship.
Copyright© Nahid Sultana 2017-2018 12/8/2022
Properties of Correlation
22
Objectives
➢ Regression lines
➢ Least-squares regression line
➢ Facts about Least-Squares Regression
➢ Correlation and Regression
The least-squares regression line is the line that makes the sum of
the squares of the vertical distances of the data points from the
line as small as possible.
sy
First we calculate the slope of the line, b1 = r
sx
Where
r is the correlation,
sy is the standard deviation of the response variable y,
sx is the standard deviation of the explanatory variable x.
Once we know b1, the slope, we can calculate b0, the y-intercept:
b0 = y − b1 x
Where x and y are the sample means of the x and y variables
Example:
Fitted Line Plot Fitted Line Plot
Fat = 3.505 - 0.003441 NEA NEA = 745.3 - 176.1 Fat
700
4
600
3
400
300
2
200
100
1
0
-100
0
-100 0 100 200 300 400 500 600 700 0 1 2 3 4
Nonexercise activity (calories) Fat gain (Kilograms)
Correlation coefficient of NEA and Fat, r = -0.779 stay same in both cases
Copyright© Nahid Sultana 2017-2018 12/8/2022
BEWARE!!!
31
Not all calculators and software use the same convention. Some use:
yˆ = a + bx
yˆ = ax + b
And some use:
Make sure you know what YOUR calculator gives you for a and b before
you answer homework or exam questions.
Facts About Least-Squares Regression
32
➢ Regression line:
y-hat = 71.95 + .383 x
➢ Height at age 42 months?
y-hat = 88
➢ Height at age 30 years?
y-hat = 209.8
➢ She is predicted to be 6’10.5”
at age 30! What’s wrong?
Copyright© Nahid Sultana 2017-2018 12/8/2022
Coefficient of determination, r 2
37
r = -1 Changes in x
r2 = 1 explain 100% of r = 0.87
the variations in y. r2 = 0.76
Y can be entirely
predicted for any
given value of x.
Changes in x
r=0
explain 0% of the Here the change in x only
r2 = 0
variations in y.
explains 76% of the change in
The values y takes
y. The rest of the change in y
are entirely
independent of (the vertical scatter, shown as
what value x red arrows) must be explained
takes. by something other than x.
38
Copyright© Nahid Sultana 2017-2018 12/8/2022
r r==–0.3, r 2 = 0.09, or 9%
–0.3, r 2 = 0.09, or 9%
The
Theregression
regressionmodel
modelexplains
explainsnot
noteven
even10%
10%
ofofthe
thevariations
variationsininy.y.
Predicted ŷ
dist. ( y − yˆ ) = residual
Observed y
43
Copyright© Nahid Sultana 2017-2018 12/8/2022
Residuals are randomly
scattered—good!
x= hours of exercise
sx=4.8
per week
r =−0.88
Find the equation of the least-squares regression line for predicting resting
heart rate from the hours of exercise per week.
48
2.5 Data Analysis for Two-Way Tables
49
Objectives
First factor:
Parent smoking status 400 1380
416 1823
188 1168
This 3X2 two-way table has 3 rows and 2 columns. Numbers are counts
or frequency
Copyright© Nahid Sultana 2017-2018 12/8/2022
Margins
52
Margins show the total for each column and each row.
400 1380
416 1823
188 1168
25%
Neither 188 1168 25.2% 20%
10%
5%
The marginal distributions can be 0%
displayed on separate bar graphs, Both One Neither
Student smoking
typically expressed as percents
Sum of Counts Parents
90%
40%
second one. 30%
10%
be shown in a pie chart.Copyright© Nahid Sultana 0%2017-2018 12/8/2022
Smoker Nonsmoker
Conditional Distribution
55
400
400 1380
1380
416
416
1823
1823
188 1168
188 1168
Percent of students
Percent who
of students smoke
who smoke when
whenboth
bothparents smoke
parents smoke =
= 400/1780
400/1780
Copyright© Nahid Sultana 2017-2018 12/8/2022
= 22.5%
= 22.5%
Conditional distributions (Cont…)
56
400 1380
416 1823
188 1168
➢ In the table below, the 25 to 34 age group occupies the first column.
29.30% = 11071
37785
= cell total .
column total
Response Percent
Almost no chance 194/4826 = 4.0%
Some chance 712/4826 = 14.8%
A 50-50 chance 1416/4826 = 29.3%
A good chance 1421/4826 = 29.4%
Copyright© Nahid Sultana 2017-2018 12/8/2022
Almost certain 1083/4826 = 22.4%
Conditional Distribution
Young adults by gender and chance of getting rich 1. Calculate the conditional distribution of opinion
63
Female Male Total among males.
Almost no chance 96 98 194 2. Examine the relationship between gender and
Some chance, 426 286 712
opinion.
but probably not
A 50-50 chance 696 720 1416
Not
Counts Accepted Total
accepted Not
Percents Accepted
accepted
Men 198 162 360
Men 55% 45%
Women 88 112 200
Women 44% 56%
Total 286 274 560
ART SCHOOL
Not Not
Counts Accepted Total Percents Accepted
accepted accepted
Men 180 60 240
Women 64 16 80 Men 75% 25%
Total 244 76 320 Women 80% 20%
Within each school a higher percentage of women were accepted than men.
Copyright© Nahid Sultana 2017-2018 12/8/2022
Simpson’s Paradox (cont…)
Within each school a higher percentage of women were accepted than men.
66