Lecture 12 - Adv. Correlation and Multiple Regression

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Linear Regression

Lecture 12
INSTRUC TOR:
DR. MAHA AMIN HASSANEIN
P R O F E S S O R E N G I N E E R I N G M AT H E M AT I C S A N D P H Y S I C S D E PA R T M E N T
F A C U LT Y O F E N G I N E E R I N G
CAIRO UNIVERSITY
Study Outline
Multiple Linear Regression
Variance-Covariance Matrix
R2 Best of fit

Spring 2024 DR. MAHA A. HASSANEIN 2


Multiple Linear Regression
Making predictions based on multiple factors X on an
outcome Y
Y : the Dependent Variable
(𝑋1 , 𝑋2 , … , 𝑋𝑘 ): Independent Variables
Regression Coefficients
𝛽0 𝑡ℎ𝑒 𝑖𝑛𝑡𝑒𝑟𝑐𝑒𝑝𝑡
(𝛽1 , 𝛽2 , … , 𝛽𝑘 ) the slopes
gives the strength and direction of the relationship

Spring 2024 DR. MAHA A. HASSANEIN 3


Multiple Linear Regression
Under Same assumptions in the simple linear regression,
the Multiple Linear Regression Equation:

𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + … + 𝛽𝑘 𝑋𝑘 + 𝜀

ε represents the error term (unexplained variation in Y)

Spring 2024 DR. MAHA A. HASSANEIN 4


Illustrative Example
Data: Female/Male Wage , yearly wage , Age in years, Educ: level
1 to 4 , partime –job (1 if no works ,0 if work )
The linear relation ship between the response (Wage) and the
predictor variables is given by :

𝑦𝑖 = 𝛽0 + 𝛽1 𝑥1𝑖 + 𝛽2 𝑥2𝑖 + 𝛽3 𝑥3𝑖 + 𝛽4 𝑥4𝑖 + 𝑒𝑖


Dependent :
𝑦 = 𝑊𝑎𝑔𝑒
Independent:
𝑥1 = 𝐹𝑒𝑚𝑎𝑙𝑒, 𝑥2 = 𝐴𝑔𝑒, 𝑥3 = 𝐸𝑑𝑢𝑐, 𝑥4 = 𝑃𝑎𝑟𝑡𝑖𝑚𝑒

Spring 2024 DR. MAHA A. HASSANEIN 5


Multiple Linear Regression
Model (In Matrix Form)
For n observations and k factors

𝑌 = 𝑋𝛽 + 𝜀
𝑌 is 𝑛 × 1 vector
𝑋 is 𝑛 × (𝑘 + 1) coefficient matrix
𝛽 is (𝑘 + 1) × 1 unknown parameters
𝜀 𝑛 × 1 error term
෡=𝒃
The estimate of the parameter 𝜷 denoted by 𝜷

Spring 2024 DR. MAHA A. HASSANEIN 6


Least Squares Solution
The Least Squares solution of
𝑦 = 𝑋𝑏
Such that b that minimizes the residual error
𝑋𝛽 − 𝑋𝑏
The normal equation:
𝑋𝑇𝑋 𝑏 = 𝑋𝑇𝑦
If X is full rank , that is 𝑟𝑎𝑛𝑘 (𝑋) = 𝑘 + 1, then 𝑋 𝑇 𝑋 is
non-singular
−1 𝑇
෠ 𝑇
𝑏= 𝑋 𝑋 𝑋 𝑦
(prove)

Spring 2024 DR. MAHA A. HASSANEIN 7


Sum of Squares Error (SSE)
An estimate 𝑦ො = 𝑋𝑏 has sum of squares error
𝑛
𝑆𝑆𝐸 = ෍ 𝑦𝑖 − 𝑦ො𝑖 2
𝑖=1
In matrix form rewritten as
𝑆𝑆𝐸 = 𝑦 − 𝑦ො 𝑇 𝑦 − 𝑦ො

𝑻
𝑺𝑺𝑬 = 𝒚 − 𝑿𝒃 𝒚 − 𝑿𝒃

Spring 2024 DR. MAHA A. HASSANEIN 8


Estimate of σ2
An estimate of the (residual variance) σ2 in 𝜖 ~𝑁 0, 𝜎 2 referred
to as the residual sum of squares
1 𝑛
σ2 = 𝑠𝑒2 =
ෝ ෍ 𝑦𝑖 − 𝑦ො 2
𝑛 − (𝑘 + 1) 𝑖=1
or rewritten in terms of SSE
𝐒𝐒𝑬
𝒔𝟐𝒆 =
𝐧 − (𝒌 + 𝟏)
has degrees of freedom
df = 𝑛 − number of 𝛽′ 𝑠 in model = 𝑛 − (𝑘 + 1)

The standard error of the estimate: 𝑠𝑒 = 𝑠𝑒2

Spring 2024 DR. MAHA A. HASSANEIN 9


Example 1 ( case k=1)
Use the matrix relations to fit a straight line to the data

X 0 1 2 3 4
y 8 9 4 3 1

Spring 2024 DR. MAHA A. HASSANEIN 10


Solution
X’ Y X’X 𝑋 ′𝑋 −1 X’y
1 1 1 1 1 8 5 10 0.6 −0.2 25
0 1 2 3 4 9 10 30 −0.2 0.1 30
4
3
1

Compute 𝑏 = 𝑋 ′ 𝑋 −1 𝑋 ′ 𝑦

0.6 −0.2 25
=
−0.2 0.1 30
9
=
−2
The fitted equation: 𝑦ො = 9 − 2𝑥

Spring 2024 DR. MAHA A. HASSANEIN 11


Cnt’d.
The fitted values : 𝑦ො = 𝑋𝑏
1 0 9
1 1 9 7
= 1 2 = 5
1 3 −2
3
1 4 1
Thus
−1
2
𝑦 − 𝑦ො = −1
0
0
6
→ 𝑆𝑆𝐸 = 6 and 𝑠𝑒2 = = 2.0
5−(1+1)

Spring 2024 DR. MAHA A. HASSANEIN 12


Case k=2 Two predictor variables
𝑌 = 𝛽0 + 𝛽1 𝑋1 + 𝛽2 𝑋2
1 𝑥11 𝑥12
𝑏0
1 𝑥21 𝑥22
𝑦ො = 𝑏1 = 𝑋𝑏
⋮ ⋮ ⋮
𝑏2
1 𝑥𝑛1 𝑥𝑛2

𝑛 Σ𝑥1 Σ𝑥2 Σ𝑦
𝑋 𝑇 𝑋 = Σ𝑥1 Σ𝑥12 Σ𝑥1 𝑥2 , 𝑋 𝑇 𝑌 = Σ𝑥1 𝑦
Σ𝑥2 Σ𝑥2 𝑥1 Σ𝑥22 Σ𝑥2 𝑦

−1
𝑏෠ = 𝑋 𝑇 𝑋 𝑋 𝑇 𝑦
1
𝑠𝑒2 = 𝑦 − 𝑋𝑏 𝑇 𝑦 − 𝑋𝑏
𝑛−3

Spring 2024 DR. MAHA A. HASSANEIN 13


Example ( case k=2)
Data y: Number of twists required to break alloy bar, x1: %of
element A in bar, x2: %of element B in bar.
Fit a least squares regression plane and use to estimate
number of twists required to break a bar with x1=2.5 ,
x2=12.

y 41 49 69 40 50 43
x1 1 2 3 1 2 4
x2 5 5 5 10 10 20

Spring 2024 DR. MAHA A. HASSANEIN 14


Solution
X’ Y X’X 𝑋 ′𝑋 −1 X’y

1 1 1 1 1 1 41 6 13 55 0.915 −0.244 −0.024 285


1 2 3 1 2 4 42 13 35 140 −0.244 0.233 −0.028 644
5 5 5 10 10 20 69 55 140 675 −0.024 −0.028 0.009 2520
40
50
43

0.915 −0.244 −0.024 285


Compute 𝑏 = 𝑋 ′ 𝑋 −1 𝑋 ′ 𝑦 = −0.244 0.233 −0.028 644
−0.024 −0.028 0.009 2520
𝟒𝟑. 𝟐𝟒
𝒃= 𝟖. 𝟖 ,
−𝟏. 𝟔𝟏𝟓
ෝ = 𝟒𝟑. 𝟐𝟒 + 𝟖. 𝟖𝒙𝟏 − 𝟏. 𝟔𝟏𝟓𝒙𝟐
The fitted equation: 𝒚

Spring 2024 DR. MAHA A. HASSANEIN 15


Variance-Covariance Matrix
The estimated variances and covariance's of the least squares estimators
expressed as: (Assuming zero mean of error, constant σ² for 𝑒𝑖 𝑒𝑖 and 0
for 𝑒𝑖 𝑒𝑗 ):
෢ (𝑏0 )
𝑉𝑎𝑟 ෢ 0 , 𝑏1 ) ⋯ 𝐶𝑜𝑣(𝑏
𝐶𝑜𝑣(𝑏 ෢ 0 , 𝑏𝑘 )
2 ′ −1 ෢ 1 , 𝑏0 )
𝐶𝑜𝑣(𝑏 ෢ 1)
𝑉𝑎𝑟(𝑏 ⋯ ෢ 1 , 𝑏𝑘 )
𝐶𝑜𝑣(𝑏
𝑠𝑒 ∗ 𝑋 𝑋 =
⋮ ⋮ ⋱ ⋮
෢ 𝑘 , 𝑏0 ) 𝐶𝑜𝑣(𝑏
𝐶𝑜𝑣(𝑏 ෢ 𝑘 , 𝑏1 ) ෢ 𝑘)
⋯ 𝑉𝑎𝑟(𝑏
Let 𝑋 ′ 𝑋 −1 =C , then

2
𝜎ො𝑏2𝑖 ෢ 𝑏𝑖 = 𝐸
= 𝑉𝑎𝑟 𝑏𝑖 − 𝐸 𝑏𝑖 = 𝑐𝑖𝑖 𝑠𝑒2
෢ 𝑏𝑖 , 𝑏𝑗 = 𝐸
𝜎ො𝑏𝑖 𝑏𝑗 = 𝐶𝑜𝑣 𝑏𝑖 − 𝐸 𝑏𝑖 𝑏𝑗 − 𝐸 𝑏𝑗 = 𝑐𝑖𝑗 𝑠𝑒2

Spring 2024 DR. MAHA A. HASSANEIN 16


Example 1.
Variance-Covariance Matrix
The Variance-Covariance Matrix
𝐶 = 𝑠𝑒2 ∗ 𝑋 ′ 𝑋 −1
0.6 −0.2
= 2.0
−0.2 0.1
The estimated variance and covariance
෢ 𝑏0 = 2.0 ∗ 𝑐11 = 1.2
𝜎ො𝑏20 = 𝑉𝑎𝑟
෢ 𝑏1 = 2.0 ∗ 𝑐22 = 0.2
𝜎ො𝑏21 = 𝑉𝑎𝑟
෢ 𝑏1 , 𝑏0 = −0.4
𝐶𝑜𝑣

Spring 2024 DR. MAHA A. HASSANEIN 18


*Recall : Estimates 𝛼ො and 𝛽መ
As 𝛼ො = 𝑎 and𝛽መ = 𝑏 are linear functions of independent
normal variables, they are random variables and normally
distributed
𝑎~𝑁(𝛼, 𝜎𝑎2 ) and 𝑏~𝑁(𝛽, 𝜎𝑏2 )

1 𝑥ҧ 2 Σ𝑥𝑥
2 2 2
𝜇𝑎 = 𝛼, 𝜎𝑎 = 𝑠𝑒 + = 𝑠𝑒
𝑛 𝑆𝑥𝑥 𝑛𝑆𝑥𝑥
𝑠 2
𝑒
𝜇𝑏 = 𝛽 , 𝜎𝑏2 =
𝑆𝑥𝑥

Fall 2021 DR. MAHA A. HASSANEIN 19


Example 1. CI of regression
parameters
For 95% confidence 𝑡𝛼 = 2.306
2

CI for intercept :
𝑎 𝜖 9 ± 2.306 ∗ 2.0 × 0.6
CI for slope :
𝑏 𝜖 − 2 ± 2.306 ∗ 2.0 × 0.1

Fall 2021 DR. MAHA A. HASSANEIN 20


Estimates 𝛽መ𝑖
As 𝛽መ𝑖 = 𝑏𝑖 are linear functions of independent normal
variables, they are random variables and normally
distributed. Thus,
1 − 𝛼 %, CI of 𝑝𝑎𝑟𝑎𝑚𝑒𝑡𝑒𝑟𝑠 is
𝛽መ𝑖 ± 𝑡𝛼,𝑛−𝑘 𝑠𝑒 𝑐𝑖𝑖
2
(prove)

Fall 2021 DR. MAHA A. HASSANEIN 21


R-squared
Measure of Quality of Fit
R-squared, also known as the coefficient of determination, measures how well
the linear regression model fits the data. It ranges from 0 to 1.
The Measure of the proportion of y variability explained by the linear
model
𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑑𝑢𝑒 𝑡𝑜 𝑟𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛
𝑇𝑜𝑡𝑎𝑙 𝑠𝑢𝑚 𝑜𝑓 𝑠𝑞𝑢𝑎𝑟𝑒𝑠

2
𝑆𝑥𝑦

𝑆𝑆𝑇 − 𝑆𝑆𝐸 𝑆𝑥𝑥 2
𝑆𝑥𝑦
= = = = 𝑟2
𝑆𝑆𝑇 𝑆𝑦𝑦 𝑆𝑥𝑥 𝑆𝑦𝑦

Spring 2024 DR. MAHA A. HASSANEIN 22


Recall:Correlation Coefficient r
The correlation coefficient is the sum of the products of the standardized
observations
1 σ 𝑥𝑖 − 𝑥ҧ 𝑦𝑖 − 𝑦ത
𝑟=
𝑛−1 𝑠𝑥 𝑠𝑦
Since,
σ 𝑥𝑖 − 𝑥ҧ 2
𝑆𝑥𝑥
𝑠𝑥2 = =
𝑛−1 𝑛−1
A simple formula for r
𝑆𝑥𝑦
𝑟=
𝑆𝑥𝑥 𝑆𝑦𝑦

Spring 2024 DR. MAHA A. HASSANEIN 23


Values of R-squared
If SSE=0 then
𝑟 2 =1 “Fit is perfect”
Elseif SSE ≈SST, then
𝑟2 ≈ 0 “Poor Fit“

R-squared values between 0 and 0.3: Weak fit


R-squared values between 0.3 and 0.7: Moderate fit
R-squared values above 0.7: Strong fit

Spring 2024 DR. MAHA A. HASSANEIN 24


Example 3
The following are the numbers of minutes to complete
x Y
a task in the morning ,x, and in the late afternoon, y. 11.1 10.9
Calculate the sample correlation coefficient. 10.3 14.2
12.0 13.8
15.1 21.5
13.7 13.2
18.5 21.1
17.3 16.4
14.2 19.3
14.8 17.4
15.3 19.0
Spring 2024 DR. MAHA A. HASSANEIN 25
Solution Σ𝑥𝑥 = 2085.31
x Y

11.1 10.9
Σ𝑥𝑦 = 2434.69
10.3 14.2 Σ𝑦𝑦 = 2897.80
12.0 13.8
The sums :
15.1 21.5
142.3 2
13.7 13.2 𝑆𝑥𝑥 = 2085.31 − = 60.381
10
18.5 21.1
142.3 166.8
17.3 16.4 𝑆𝑥𝑦 = 2434.69 −
10
14.2 19.3 = 61.126
14.8 17.4
166.8 2
15.3 19.0
𝑆𝑦𝑦 = 2897.80 − = 115.576
10
Σ𝑥 =142.3 Σ𝑦 =166.8 61.126
Hence, 𝑟 = = 0.732
60.381∗115.576
Spring 2024 DR. MAHA A. HASSANEIN 26
Sol. Cnt’d
As r =0.732 , the proportion of variation in y
attributed to x is
𝑟 2 = 0.732 2 = 0.536
an R-squared over 0.5 indicates the morning assembly times
have a statistically significant (over 50%) ability to linearly
predict and explain what will happen with the afternoon
times based on the sample data

Spring 2024 DR. MAHA A. HASSANEIN 27


Summary Boston Results
Call : fit1lm(formula = medv ~ lstat, data = Boston)
Residuals:
Min 1Q Median 3Q Max
-15.168 -3.990 -1.318 2.034 24.500
Coefficients:
Estimate Std. Error t value Pr(>|t|)
Intercept 34.55384 0.56263 61.41 <2e-16
lstat -0.95005 0.03873 -24.53 <2e-16

Residual standard error: 6.216 on 504 degrees of freedom


Multiple R-squared: 0.5441, Adjusted R-squared: 0.5432
F-statistic: 601.6 on 1 and 504 DF, p-value: < 2.2e-16

Spring 2024 DR. MAHA A. HASSANEIN 28


*Example 4 using R
data(mtcars)
str(mtcars)
head(mtcars)
model <- lm(mpg ~ wt, data = mtcars)
summary(model)
plot(mtcars$wt, mtcars$mpg, xlab = "Weight",
ylab = "Miles per Gallon", main = "Scatter Plot of
MPG vs. Weight")
abline(model, col = "red")
t.test(mpg ~ am, data = mtcars)
cor(mtcars$mpg, mtcars$hp)

Ans. Cor=-0.7761684

Fall 2021 DR. MAHA A. HASSANEIN 29


Cnt’d
>summary(model)
Call: lm(formula = mpg ~ wt, data = mtcars)
Residuals:
Min 1Q Median 3Q Max
-4.5432 -2.3647 -0.1252 1.4096 6.8727
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 37.2851 1.8776 19.858 < 2e-16 ***
wt -5.3445 0.5591 -9.559 1.29e-10 ***
Residual standard error: 3.046 on 30 degrees of freedom
Multiple R-squared: 0.7528, Adjusted R-squared: 0.7446
F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

Spring 2024 DR. MAHA A. HASSANEIN 30


Cnt’d.
Create scatter plot with regression
line and confidence interval
library(ggplot2)
scatter_plot <-
ggplot(mtcars, aes(x =
wt, y = mpg)) +
geom_point() +
geom_smooth(method =
"lm", se = TRUE, color =
"red") + labs(x =
"Weight", y = "Miles per
Gallon", title = "Scatter
Plot of MPG vs. Weight")
+ theme_minimal()
scatter_plot

Spring 2024 DR. MAHA A. HASSANEIN 31


Text book
Chapter 11.
Sec. 11.1 - 11.7

Reference

Spring 2024 DR. MAHA A. HASSANEIN 32


Thank you for your attention

Maha

Spring 2024 DR. MAHA A. HASSANEIN 33

You might also like