Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

Chapter 11

Regression and Correlation analysis

Things you must understand after this chapter:

• Plotting a scatterplot of two continuous variables.

• Find the equation of linear regression using least squares method.

• Calculating the correlation coefficient of two continuous variables.

• Calculating the coefficient of determination of continuous two variables.

• Testing the population correlation coefficient.

1
Simple linear regression and correlation
• Two variables are measured simultaneously.
• The relationship between the two variables is explored/investigated.

Let x and y be the two variables with 𝑛 sample size, the data of these variables is as
follows:
x 𝑥 𝑥 𝑥 … … … 𝑥
y 𝑦 𝑦 𝑦 … … … 𝑦

with x the independent variable and y the dependent variable.

First thing to explore the relationship between x and y is to plot a scatterplot.

Example: Scatterplot: The table below shows the number of absences (x), in a Calculus
course and the final exam grade (y), for 7 students. Plot the scatterplot for the data:
Number of absences (x) 1 0 2 6 4 3 3
Exam grades (y) 95 90 90 55 70 80 85

Scatterplot showing the relationship between the


number of absences and final exam grades

Comment: There is a negative linear relationship between the number of absences and
the final exam grade.
4

2
Example scatterplot: The time x in years that an employee spent at a company and
the employee’s hourly pay, y, for 9 employees are listed in the table below.
Hourly pay (y) Time (x) Scatterplot showing the relationship between
time spent in a company and hourly pay.
25 5
20 3
21 4
42 10
38 15
15 2
44 13
40 12
27 7

Comment: There is a positive linear relationship between the time an employee


spent at a company and hourly pay.

Regression equation (Least Squares)


Least squares method is used to find the equation of the regression line:

𝑦 = 𝑎 + 𝑏𝑥

where:
1
𝑆 ∑ 𝑥 𝑦 − (∑ 𝑥 )(∑ 𝑦)
𝑏= = 𝑛 − 𝒔𝒍𝒐𝒑𝒆
𝑆 1
∑ 𝑥 − ∑ 𝑥
𝑛

𝑎 = 𝑦 − 𝑏𝑥̅ − 𝒊𝒏𝒕𝒆𝒓𝒄𝒆𝒑𝒕

3
Example: Regression equation
The table below shows the number of absences (x), in a Calculus course and the
final exam grade (y), for 7 students. Construct the equation of the regression line:

𝐱 𝐲 𝐱𝟐 𝐲𝟐 𝐱𝐲
1 95 1 9025 95
0 90 0 8100 0
2 90 4 8100 180
6 55 36 3025 330
4 70 16 4900 280
3 80 9 6400 240
3 85 9 7225 255

𝑥 = 19 𝑦 = 565 𝑥 = 75 𝑦 = 46775 𝑥 𝑦 = 1380

1 1
∑ 𝑥 𝑦 − (∑ 𝑥 )(∑ 𝑦) 1380 − (19)(565)
𝑏= 𝑛 = 7 = −6.5549
1 1
∑ 𝑥 − ∑ 𝑥 75 − 19
𝑛 7

∑ 𝑥 19 ∑ 𝑦 565
𝑥̅ = = = 2.7143; 𝑦= = = 80.7143
𝑛 7 𝑛 7

𝑎 = 𝑦 − 𝑏𝑥̅ = 80.7143 − −6.5549 2.7143 = 98.5061

𝑦 = 𝑎 + 𝑏𝑥 = 98.5061 − 6.5549𝑥

Estimate the final exam grade if the student is absent for seven days?
𝑦| = 98.5061 − 6.5549𝑥 = 98.5061 − 6.5549 7 = 52.6218

4
Plotting the regression line on the scatterplot:
Scatterplot showing the relationship between
number of absences and exam grades
120
100
A
Exam grade

80
B
60
40
20
0
0 1 2 3 4 5 6 7
Number of absences

Point A: 𝑦| = 98.5061 − 6.5549 0 = 98.5061 -> (0; 98.5061)

Point B: 𝑦| = 98.5061 − 6.5549 6 = 59.1767 -> (0; 59.1767)

Correlation
It is used to measure the strength of the relationship (r) of two variables x and y.

r = 0 – No relationship between x and y

r > 0 – Positive relationship between x and y

r < 0 – Negative relationship between x and y

10

5
Correlation coefficient guideline for interpretation:

Correlation coefficient (𝒓) Relationship strength


0.70 – 1.00 Strong (+) relationship
0.40 – 0.69 Moderate (+) relationship
0.01 – 0.39 Weak (+) relationship
0.00 No relationship
-0.01 – -0.39 Weak (-) relationship
-0.40 – -0.69 Moderate (-) relationship
-0.70 – -1.00 Strong (-) relationship

11

Calculating correlation coefficient:

𝑆
𝑟= where − 1 ≤ 𝑟 ≤ +1
𝑆 𝑆
Example: Correlation coefficient: The table below shows the number of absences (x), in a
Calculus course and the final exam grade (y), for 7 students. Calculate the correlation coefficient
of the data.

1
𝑆 ∑ 𝑥 𝑦 − 𝑛 (∑ 𝑥 )(∑ 𝑦)
𝑟= = = −0.9270
𝑆 𝑆 1 1
∑ 𝑥 −𝑛 ∑ 𝑥 ∑ 𝑦 −𝑛 ∑ 𝑦

Comment: There is a strong negative relationship between the number of absences (x) and the

final exam grade (y) in a Calculus course.

12

6
Calculating coefficient of determination (𝑟 ):
1
∑ 𝑥 𝑦 − 𝑛 (∑ 𝑥 )(∑ 𝑦)
𝑟 =
1 1
∑ 𝑥 −𝑛 ∑ 𝑥 ∑ 𝑦 −𝑛 ∑ 𝑦

Example: Coefficient of determination


The table below shows the number of absences (x), in a Calculus course and the
final exam grade (y), for 7 students. Calculate coefficient of determination of the
data.

𝑟 = −0.9270 → 𝑟 = −0.9270 = 0.8593


Comment: 85.93% of the total variation in the final exam grades is explained by the
number of absences in a Calculus course. 14.07% is not explained.
13

Testing the population correlation coefficient:


Step 1: Stating the hypothesis
𝐻 : 𝜌=0
𝐻 : 𝜌≠0
Step 2: Level of significance
𝛼 − 𝑔𝑖𝑣𝑒𝑛

Step 3: Test statistics


𝑟
𝑡= ; 𝑟 𝑖𝑠 𝑡ℎ𝑒 𝑐𝑜𝑟𝑟𝑒𝑙𝑎𝑡𝑖𝑜𝑛 𝑐𝑜𝑒𝑓𝑓𝑖𝑐𝑖𝑒𝑛𝑡 𝑓𝑟𝑜𝑚 𝑡ℎ𝑒 𝑠𝑎𝑚𝑝𝑙𝑒
1−𝑟
𝑛−2

14

7
Step 4: Critical values (t-distribution)

𝑡 ;
and 𝑡 ;
= −𝑡 ;

Step 5: Decision and conclusion

If 𝑡 ≤ −𝑡 ;
or 𝑡 ≥ 𝑡 ;
reject 𝐻 , significant correlation
If −𝑡 ;
<𝑡<𝑡 ;
do not reject 𝐻 , no significant correlation

15

Example: Testing the population correlation coefficient


The table below shows the number of absences (x), in a Calculus course and the
final exam grade (y), for 7 students. Test at 5% significance level if there is a
significant correlation between the number of absences (x) and the final exam
grade.

Step 1: Stating the hypothesis


𝐻 : 𝜌=0
𝐻 : 𝜌≠0

Step 2: Level of significance


𝛼 = 0.05

16

8
Step 3: Test statistics
𝑟 −0.9270
𝑡= = = −5.53
1−𝑟 1 − (−0.9270)
𝑛−2 7−2

Step 4: Critical values (t-distribution)

𝑡 ; . = 2.571 and 𝑡 ; . = −2.571

Step 5: Decision and conclusion

Since 𝑡 ≤ −2.571 we reject 𝐻 , and we conclude that at a 5% level of


significance, there is a significant correlation between the number of
absences and the final exam grade.

17

Exercise
A data analyst working for a transportation company has been tasked with
studying the relationship between the number of hours a delivery truck is on the
road and the total number of deliveries it makes in a day. Data is collected from 8
different days:
Hours on Road Number of Deliveries
(X) (Y)
5 25
9 35
8 32
7 30
4 22
10 38
3 20
6 27

18

9
1.1 Plot a scatterplot between the number of hours on the road and the number of deliveries.

1.2 Calculate the regression equation using the least squares method to model the relationship
between hours on the road and the number of deliveries. What can you conclude about the
slope?

1.3 Estimate the number of deliveries for a delivery truck that is on the road for twelve hours.

1.4 Calculate and interpret the correlation coefficient to determine the strength and the
direction of the relationship between these two variables.

1.5 Calculate and interpret the coefficient of determination between these two variables.

1.6 Perform a hypothesis test to determine whether the correlation coefficient between hours
on the road and the number of deliveries is statistically significant. Use a significance level
of 0.05.

19

10

You might also like