Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Learning Guide Module

Subject Code Stat1 Introduction to Statistics


Module Code 10.0 Linear Regression, Part I
Lesson Code 10.3 Regression Line (Part 2); Association vs Causation
Time Limit 30 mins.

TARGET
At the end of this lesson, the learner is expected to:
✓ predict response variable 𝑦 given the value of the explanatory variable 𝑥 using the
regression line; and
✓ explain why association does not imply cause-and-effect relationship.

HOOK
The previous lesson taught you how to compute for the coefficients of the regression line and
how to interpret it. With the use of technology, you can easily find these values and derive
the equation. Now, this equation can be used to predict values of the response variable for each of the
given values of the explanatory variable. That will be covered in this lesson.
Also, you have learned that the strength of linear relationship between two quantitative
variables is measured by the correlation coefficient. But have you ever wondered why some variables
have high correlation even if the relationship does not make any sense? Have you heard about a high
correlation between drowning incidents and ice cream sales? Or a high correlation between civil
engineering doctorates awarded and per capita consumption of mozzarella cheese? These questions will
be addressed in this lesson.

IGNITE
Example 1: The owner of a milk tea shop found out that there is a linear correlation between
the daily atmospheric temperature and total number of sales they have during the summer
season. A random sample of 15 days was selected with the results given as follows. If 𝑟 = 0.931, find
the regression line equation.

Day 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

Temp (oF) 79 76 78 84 90 83 93 94 97 85 88 82 83 95 80
Total Sales
147 143 147 168 206 155 192 211 209 187 200 150 160 200 153
(Units)

We were already able to identify the corresponding regression line equation from the previous
lesson:
̂
𝑇𝑜𝑡𝑎𝑙 𝑆𝑎𝑙𝑒𝑠 = −136.294 + 3.630 ∗ 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒
Now, we can use this to predict the total number of sales given the daily atmospheric
temperature for the day. Say, if the atmospheric temperature is 87℉, what is the predicted total number
of sales? What we have to do is just substitute 87℉ to the value of 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 in the regression
equation.

Statistics 1 Page 1 of 4
̂
𝑇𝑜𝑡𝑎𝑙 𝑆𝑎𝑙𝑒𝑠 = −136.294 + (3.630)(87) = 179.516 ≈ 180 𝑢𝑛𝑖𝑡𝑠
Based on our linear model, we predict the total sales of 180 units with an atmospheric
temperature of 87℉.
If the atmospheric temperature is 90℉, what is the predicted total sales? If the actual sales for
that day is 200 units, what is the value of the residual? Did we overestimate or underestimate the total
sales for that day?
Again, we just need to substitute the atmospheric temperature to get the predicted value.
̂
𝑇𝑜𝑡𝑎𝑙 𝑆𝑎𝑙𝑒𝑠 = −136.294 + (3.630)(90) = 190.406 ≈ 190 𝑢𝑛𝑖𝑡𝑠
The predicted total sales are 190 units when the atmospheric temperature is 90℉. To compute
for the residual, we just need to subtract the predicted value from the observed value:
𝑒 = 𝑦 − 𝑦̂ = 200 − 190 = 10 𝑢𝑛𝑖𝑡𝑠
From the value of this residual, we can say that we underestimated the total number of sales for
that given day by 10 units.
Indeed, the regression line can help us make predictions given that there is a linear relationship
between the two quantitative variables. We should remember that doing regression analysis is only
applicable if its conditions are met. We should not try to use it if the relationship is nonlinear. We should
also not ignore the presence of outliers, as it can greatly affect the regression line. Lastly, we should be
careful in identifying the response or explanatory variable in our study, otherwise our regression
equation might be inverted.

To facilitate the discussion on causation, please watch the videos linked below:

https://www.khanacademy.org/partner-content/wi-phi/wiphi-critical-thinking/wiphi-
fundamentals/v/critical-thinking-fundamentals-correlation-and-causation

https://www.youtube.com/watch?v=gxSUqr3ouYA

“Just because there is a correlation between two variables doesn’t mean that one causes the
other.” This statement summarizes the main concern of both videos. We tend to see causal relationships
in everything even if there might be underlying reasons behind these phenomena. Sometimes, the
relationship may just be purely coincidental, or it might be caused by a third variable. This third variable
is what we call lurking variable.
For example, the number of drowning incidents and sales of ice creams might be strongly
correlated, but that does not mean that eating ice cream contributes to the number of people drowning
or vice versa. What do these variables have in common that we might be overlooking? A possible
lurking variable is the temperature or humidity of the day. Since it is hot or humid, people tend to crave
for cold treats like ice cream or decides to go swimming. Hence, the amount of people going swimming
and buying ice cream highly correlates with one another.
Another example would be obtaining a high correlation between the grades of high school
students and their average length of sleep on school days. Yes, we could say that the length of sleep
might contribute to your academic performance; but there are other many factors that might affect one’s
performance in class like IQ, hours of study, study habits, etc. This is an example of a correlation where
there might be a complexity of relationship among many other variables. What we should always keep
in mind is that association does not imply causation.

Statistics 1 Page 2 of 4
The use of the correlation coefficient and regression analysis should be done carefully. It might
be misleading to simply conclude based on these results without noticing the possibility of lurking
variables. So next time you see strongly correlated variables, think whether or not you can really say a
direct cause-and-effect took place, or it may be caused by a lurking variable, or the correlation is just
purely by coincidence.

NAVIGATE
It is now your turn to apply the things we have discussed. Follow the indicated instruction.
The following are items based on your textbook exercises (Intro Stats by De Veaux, Velleman, & Bock).
You may use a calculator or a software (MS Excel) in answering the following. You may write your
answers in a clean sheet of paper following the usual format or you may type your answers in a word
document. For items involving computations, show your complete solutions then box your final
answers.
Note: The first number is an exercise (non-graded); while the second number is the quiz (graded)
1. The data below shows the number of salespeople working in a store and the sales they generate
for the day. The correlation coefficient is 𝑟 = 0.965.
Number of
2 3 7 9 10 10 12 15 16 20
Salespeople Working
Sales (in Php1000) 50 55 65 70 90 100 100 110 110 130

a) Identify the appropriate response and explanatory variables in this study. Explain your
answer briefly.
b) Find the slope estimate, 𝑏1 .
c) Interpret the value of the slope in this context.
d) Find the intercept, 𝑏0 .
e) What does the intercept mean given the context of the variables? Is it meaningful?
f) If 18 salespeople are working, what is the predicted sales?
g) If the actual sales when 18 salespeople are working are Php125,000, what is the value of
the residual? Is it an overestimate or an underestimate?

2. Some data on hard drives are given. You want to predict the selling price of the hard drive from
its capacity. The summary statistics for the variables are given below.

Mean Standard Deviation


Price (in Php) 6,700 7,563
𝑟 = 0.994
Capacity (in TB) 1.110 1.4469

a) Find the slope estimate, 𝑏1 .


b) Interpret the value of the slope in this context.
c) Find the intercept, 𝑏0 .
d) What does the intercept mean given the context of the variables? Is it meaningful?
e) Write the regression line equation that predicts the selling price from the capacity of the
hard drive.
f) What would you predict for the price of a 3.0 TB hard drive?

Statistics 1 Page 3 of 4
KNOT
In summary, we have discussed in this module important concepts involving linear
regression.
✓ In regression analysis, it is important to identify the roles of the variables.
✓ Response variable is the one we hope to predict using the linear model while explanatory variable
tries to explain the changes in the response variables.
✓ Linear regression is used for two quantitative variables with linear relationship.
✓ We should be on the lookout for possible outliers in the data set, as it can affect the regression line.
✓ 𝑅𝑒𝑠𝑖𝑑𝑢𝑎𝑙 = 𝑂𝑏𝑠𝑒𝑟𝑣𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 − 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
✓ Linear model:
𝑦̂ = 𝑏0 + 𝑏1 𝑥
✓ Formula for the slope:
𝑠𝑦
𝑏1 = 𝑟
𝑠𝑥
✓ Formula for the intercept:
𝑏0 = 𝑦̅ − 𝑏1 𝑥̅

✓ The regression line can be used to predict the response variable given a value of the explanatory
variable.
✓ The Data Analysis Toolpak in MS Excel can easily generate summary statistics and the coefficients
for regression analysis.
✓ Association ≠ Causation. Some correlations may be caused by a lurking variable or a more complex
relationship among other variables.

References:
1. Bluman, A.G. (2014). Elementary Statistics: A step by step approach (9th Edition). Mc-Graw Hill
2. De Veaux, R.D., Velleman, P.F., & Bock, D.E. (2014). Intro stats (New International Edition).
Great Britain: Pearson Education Limited

Prepared by: Mark Louvelle Parulan Reviewed by: Myrna B Libutaque


Position: SST I Position: SST V
Campus: PSHS – MC Campus: PSHS – WVC

© 2020 Philippine Science High School System. All rights reserved. This document may contain proprietary information and may only be
released to third parties with approval of management. Document is uncontrolled unless otherwise marked; uncontrolled documents
are not subject to update notification.

Statistics 1 Page 4 of 4

You might also like