Professional Documents
Culture Documents
Regression And-Correlation
Regression And-Correlation
Regression And-Correlation
X (Education) Y (Income) 5
(Criterion Variable)
Lifetime Earnings
8 3.4
4
7 4.4
6 2.5 3
5 2.1 2
4 1.6
1
3 1.5
2 1.2 0
1 1 0 2 4 6 8 10
Education (Predictor Variable)
Direction of Relationship
-1 ------------ 0 ------------ +1
Perfect Relationship No Relationship Perfect Relationship
Definitional formula:
degree to which X and Y vary together
r
degree to which X and Y vary separately
COVXY ( X X )(Y Y )
r COV XY
(s x )(sy ) n
Computational formula:
n( XY ) ( X )( Y )
r
( n X ( X ) )( n Y ( Y ) )
2 2 2 2
An Example: Correlation
X (Education) Y (Income) 5
(Criterion Variable)
Lifetime Earnings
8 3.4
4
7 4.4
6 2.5 3
5 2.1 2
4 1.6
1
3 1.5
2 1.2 0
1 1 0 2 4 6 8 10
Education (Predictor Variable)
An Example: Correlation
X Education Y Income XY X2 Y2
8 3.4 27.2 64 11.56
7 4.4 30.8 49 19.36
6 2.5 15 36 6.25
5 2.1 10.5 25 4.41
4 1.6 6.4 16 2.56
3 1.5 4.5 9 2.25
2 1.2 2.4 4 1.44
1 1 1 1 1
36 17.7 97.8 204 48.83
X 36
Y 17.7 n( XY ) ( X )( Y )
r
XY 97.8 ( n X 2 ( X ) 2 )( n Y 2 ( Y ) 2 )
X 2 204
Y 2 48.83
n8
An Example: Correlation
X 36
Y 17.7
XY 97.8
X 2 204
Y 2 48.83
n8
An Example: Correlation
Researchers who measure reaction time for human participants
often observe a relationship between the reaction time scores
and the number of errors that the participants commit. This
relationship is known as the speed-accuracy tradeoff. The
following data are from a reaction time study where the
researcher recorded the average reaction time (milliseconds)
and the total number of errors for each individual in a sample of
8 participants. Calculate the correlation coefficient.
Speed Accuracy Tradeoff
234 2
197 7
189 13 5
221 10
237 4 0
192 9 150 175 200 225 250
Reaction Time
An Example: Correlation
X X2 Y Y2 XY
184 33856 10 100 1840
213 45369 6 36 1278
234 54756 2 4 468
197 38809 7 49 1379
189 35721 13 169 2457
221 48841 10 100 2210
237 56169 4 16 948
192 36864 9 81 1728
1667 350385 61 555 12308
n( XY ) ( X )( Y )
r
( n X 2 ( X ) 2 )( n Y 2 ( Y ) 2 )
8(12308) (1667)(61)
r
8(350385) (1667) 2 8(555) (61) 2
0.77
Interpreting Pearson’s r
StudyHrs = average number of hours spent per week studying for 209
GPA = grade-point average earned in 209 at the end of the quarter
Step 1: Select the cell where you want your r
value to appear (you might want to label it).
Step 2: Click on the function wizard button.
Step 3: Search for and select PEARSON.
Step 4: For Array1, select all the values under StudyHrs.
For Array2, select all the values under GPA.
Step 5: That’s it! Once you have your r value,
don’t forget to round to 2 decimal places.
Knowledge check: What does the r value of 0.88 tell you about
the strength and direction of the correlation between StudyHrs
and GPA?
Scatterplots
NOTE: Your
chart must be
highlighted for
the ‘Layout’ tab to
appear under
‘Chart Tools.’
A note about x- and y-axes:
For scatterplots, it does not matter which variable
goes on each axis (this is NOT true for other
types of charts).
However, you need to make sure you label your
axes with the proper variable name.
In this example, GPA is on the y-axis and Study
Hours is on the x-axis (we can tell this based on
their different ranges of values).
As a helpful hint, Excel will automatically put the
first variable (left-hand column) on the x-axis, and
the second variable (right-hand column) on the y-
axis.
Step 6: Change the chart title by selecting it, typing a
new one, and pressing Enter. Chart and axis titles
may be altered by right-clicking on them.
Your scatterplot is now finished!
Y Y
X X
Y Y
X X
Types of Relationships
(continued)
Strong relationships Weak relationships
Y Y
X X
Y Y
X X
Types of Relationships
(continued)
No relationship
X
Hypothetical Data
10
Plant Wt Ht 9
(gm) (cm) 8
7
A 1 3 6
This is called a scatter plot
height (y)
B 2 5 5
4
C 3 7 3
2
D 4 9 1
0
Can we predict the height of 0 1 2 3 4 5 6 7 8 9 10
the plant given its weight? weight (x)
How tall would a 30gm plant
be?
Equation of a Line
10
9 Note that we can draw
8 a straight line that will
7 contain all the points.
6
height (y)
5
4 Note that Y refers to
3 height and X refers to
2 the weight.
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9 The variables on the x
8 and y axes are
7 sometimes referred to
6 as…
height (y)
5
4 x-axis y-axis
3
2 independent dependent
1 predictor predicted
0 carrier response
input output
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9 The y-intercept is the
8 point on the y-axis
7 (height) where the line
6 crosses. Therefore the
height (y)
5 y-intercept is +1.
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9 The slope is equal to
8 y / x or the change
7 in y over the change in
6 x.
height (y)
5
4 2 In our case for every
3 1-unit increase in x,
2 1 we see a 2-unit
1
increase in y. Thus the
0
slope is 2/1 or 2.
0 1 2 3 4 5 6 7 8 9 10
Note: If the line falls instead
weight (x) of rises from left to right,
the slope is negative.
Equation of a Line
10
9
Therefore the equation
8 of the line is
7
6 Y = 1 + 2X
height (y)
5
4 Where 1 is the y-
3 intercept and 2 is the
2 slope.
1
0 Thus the height (Y) of
0 1 2 3 4 5 6 7 8 9 10 a 30gm plant is
weight (x)
Y = 1 + 2(30) = 61cm
Equation of a Line
10
9
In real life however,
8 we do not get such
7 perfect data. Usually,
6 we get something like
this…
height (y)
5
4
3
2
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Equation of a Line
10
9
In real life however,
8 we do not get such
7 perfect data. Usually,
6 we get something like
this…
height (y)
5
4
3 How can we predict
2 with data like this?
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Residuals
10
9
Note that we can draw
8 several lines that will
7 follow the general
6 trend of the data.
height (y)
5
4 The problem is which
3 one best fits the
2 data?
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Residuals
10
9
Each line we draw
8 may contain some of
7 the data-points.
6
For those points not
height (y)
5
4 on the line, their
3 distances from the line
2 are called residuals,
residuals
1 errors or deviations.
deviations
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Residuals
10
9
Each line we draw
8 also generates its own
7 set of residuals.
6
The line that produces
height (y)
5
4 the least amount of
3 residuals (actually,
2 mean squared
1 residuals) is the line
0 that best fits the data.
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Simple Linear Regression
10
9
The task of finding the
8 equation of such a line
7 (called a regression
6 line)
line is called…
height (y)
5
4 simple linear
3 regression.
regression
2
1
0
0 1 2 3 4 5 6 7 8 9 10
weight (x)
Test of Significance
Method 1 – test that the slope of the regression line is 0
(i.e. the regression line is horizontal). This is used to
determine if the regression line shows a statistically
significant linear relationship between X and Y.
Yi β0 β1Xi ε i
Linear component Random Error
component
Simple Linear Regression
Model
(continued)
Y Yi β0 β1Xi ε i
Observed Value
of Y for Xi
εi Slope = β1
Predicted Value
Random Error
of Y for Xi
for this Xi value
Intercept = β0
Xi X
Simple Linear Regression
Equation
The simple linear regression equation provides an
estimate of the population regression line
Estimated
(or predicted) Estimate of Estimate of the
Y value for the regression regression slope
observation i
intercept
Value of X for
Ŷi b0 b1Xi
observation i
“A false balance is an abomination to
the Lord, but a just weight is a
delight”.
Thank you for Listening
References