Professional Documents
Culture Documents
Simple Linear Regression
Simple Linear Regression
Branch 1 2 3 4 5 6 7 8 9 10
Student 2 6 8 8 12 16 20 20 22 26
Population
(1000s)
Quarterly 58 105 88 118 117 137 157 169 149 202
Sales($1000s)
Scatter plot
First-Order Linear Model
First-Order Linear Model is also called simple linear
regression.
There is only one independent variable and we assume
it to be linearly correlated with the dependent variable.
The model is
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀
Intercept Slope Random term
First-Order Linear Model
First-Order Linear Model is also called simple linear
regression.
There is only one independent variable and we assume
it to be linearly correlated with the dependent variable.
The model is
𝑌 = 𝛽0 + 𝛽1 𝑥 + 𝜀
Mean model
which means 𝐸 𝑌 𝑥 = 𝛽0 + 𝛽1 𝑥
Note that “linear model” is linear in parameters β0
and β1.
If the relationship between x and y seems to be
quadratic or any other patterns, we can still build
a linear model as
Y = 𝛽0 + 𝛽1 𝑥 2 + 𝜀 or
Y = 𝛽0 + 𝛽1 𝑒 𝑥 + 𝜀, etc.
Again, we would like to estimate the population
mean β0+β1x with samples.
We use least squares as the criterion and
calculate the numbers b0, b1 that can minimize
the following function:
𝐹(𝑏0 , 𝑏1 ) = σ𝑛𝑖=1[𝑦𝑖 − (𝑏0 + 𝑏1 𝑥𝑖 )]2
𝛽0 and 𝛽1 are parameters in the model.
𝑏0 and 𝑏1 are the estimators.
Estimating the coefficients
Least square estimation 𝑦ො𝑖 = 𝑏0 + 𝑏1 𝑥𝑖
Least Square Line Coefficients
σ𝑛 lj
𝑖=1(𝑥𝑖 −𝑥)(𝑦 lj
𝑖 −𝑦) σ𝑛
𝑖=1 𝑥𝑖 𝑦𝑖 −𝑛𝑥lj 𝑦lj
𝑏1 = σ𝑛 lj 2 = σ𝑛 2 −𝑛𝑥lj 2
𝑖=1(𝑥𝑖 − 𝑥) 𝑥
𝑖=1 𝑖
𝑠𝑥𝑦 ×(𝑛−1) 𝑠𝑥𝑦
= =
𝑠𝑥2 ×(𝑛−1) 𝑠𝑥2
𝑏0 = 𝑦lj − 𝑏1 𝑥lj
𝑌𝑖 = 𝛽0 + 𝛽1 𝑥𝑖 + 𝜀𝑖
𝑦ො𝑖 = 𝑏0 + 𝑏1 𝑥𝑖 𝑒𝑖 = 𝑦𝑖 − 𝑦ො𝑖
Student
Population (𝑥𝑖 ) 2 6 8 8 12 16 20 20 22 26 140
(1000s)
Quarterly
58 105 88 118 117 137 157 169 149 202 1300
Sales($1000s) ( 𝑦𝑖 )
𝑥𝑖 𝑦𝑖 116 630 704 944 1404 2192 3140 3380 3278 5252 21040
𝑆𝑆𝑅 = (𝑦ෝ𝑖 − 𝑦ത )2
𝑖=1
𝑆𝑆𝑇 = 𝑆𝑆𝑅 + 𝑆𝑆𝐸
Coefficient of Determination
We often use Coefficient of Determination to
assess how well the model fits the data.
𝑆𝑆𝑅 𝑆𝑆𝐸 𝑠𝑥𝑦 2
𝑅2 = =1− = = 𝑟2
𝑆𝑆𝑇 𝑆𝑆𝑇 𝑠𝑥 2 𝑠𝑦 2
variation of 𝑦 explained by the linear relation in 𝑥
=
total variation of 𝑦
𝑆𝑆𝐸
The estimate of 𝜎 2 is s𝜀2 =
(𝑛 − 2)
Branch 1 2 3 4 5 6 7 8 9 10 Sum
Student
Population (𝑥𝑖 ) 2 6 8 8 12 16 20 20 22 26 140
(1000s)
Quarterly
58 105 88 118 117 137 157 169 149 202 1300
Sales($1000s) (𝑦𝑖 )
𝑥𝑖 𝑦𝑖 116 630 704 944 1404 2192 3140 3380 3278 5252 21040
𝑦𝑖2 3364 11025 7744 13924 13689 18769 24649 28561 22201 40804 184730
𝑦ො𝑖
Between group
variation: The
𝑦lj
variation of 𝑦ො𝑖 ′s
and it is the
summation of
lj 2.
(𝑦ො𝑖 − 𝑦)
Within group
variation: The
summation of
(𝑦𝑖 − 𝑦ො𝑖 )2 .
Test with ANOVA Table
lj 2
𝑆𝑆𝑅 = (𝑦ො𝑖 − 𝑦)
lj 2
𝑆𝑆𝑇 = (𝑦𝑖 − 𝑦)
𝑆𝑆𝑅/1 under 𝐻0
Test Statistic 𝐹 = ~ 𝐹(1, 𝑛 − 2)
𝑆𝑆𝐸/(𝑛 − 2)
If 𝐹 > 𝐹1,𝑛−2,𝛼 , we will reject the null hypothesis
Distribution of b1
σ𝑛 ҧ
𝑖=1(𝑥𝑖 −𝑥)(𝑌
ത
𝑖 −𝑌) 𝜎2
b1= σ𝑛 ҧ 2
~𝑁(𝛽1 , σ𝑛 ҧ 2)
𝑖=1(𝑥𝑖 −𝑥) 𝑥
𝑖=1 𝑖 −𝑥
𝑏1 − 𝛽1
𝑇= ~ 𝑡 (𝑛 − 2)
𝑠𝜀2 /(𝑛 − 1)𝑠𝑥2
Test with t-statistic
𝑏1
Under 𝐻0 : 𝛽1 = 0, 𝑇 = ~ 𝑡 (𝑛 − 2)
𝑠𝜀2 /(𝑛
− 1)𝑠𝑥2
We will reject the null hypothesis with |𝑇|>t n−2,𝛼 /2 for 𝐻1 : 𝛽1 ≠ 0
(This is equivalent to ANOVA table)
We will reject the null hypothesis with T>t n−2,𝛼 for 𝐻1 : 𝛽1 > 0
We will reject the null hypothesis with T<−t n−2,𝛼 for 𝐻1 : 𝛽1 < 0
𝑏1 − 𝑐
We can also test 𝐻0 : 𝛽1 = 𝑐 with 𝑇 = ~ 𝑡 (𝑛 − 2)
𝑠𝜀2 /(𝑛 − 1)𝑠𝑥2
Hypothesis Testing for β1
Branch 1 2 3 4 5 6 7 8 9 10 Sum
Student
Population (𝑥𝑖 ) 2 6 8 8 12 16 20 20 22 26 140
(1000s)
Quarterly
58 105 88 118 117 137 157 169 149 202 1300
Sales($1000s) (𝑦𝑖 )
𝑦ෝ𝑖 70 90 100 100 120 140 160 160 170 190 1300
ത 2
(𝑦ෝ𝑖 − 𝑦) 3600 1600 900 900 100 100 900 900 1600 3600 14200
𝑆𝑆𝑅 = 14200
𝑆𝑆𝐸
𝑆𝜀2 = 𝑛−2 = 191.25
Test H0: β1=0; H1:β1≠0 with ANOVA:
Source df SS MS F P-value
Regression 1 14200 14200 74.25 .000
Error 10-2=8 1530 191.25
Total 10-1=9 15730
F>F1,8,0.05 ➔reject H0
The linear relationship between x and y is significant.
𝑅2 = 0.903
Test H0: β1=0; H1:β1≠0 with t-statistic:
𝑏1 5
T= = = 8.62
191.25/(9×63.11)
𝑆𝜖 2 /(𝑛−1)𝑆𝑥 2
xg
Residual
Residual
◼ 𝑒𝑖 = 𝑦𝑖 − 𝑦ෝ𝑖
Standardized residual
𝑒𝑖
◼
𝑠𝜀
𝑒𝑖 1 (𝑥𝑖 −𝑥)ҧ 2
◼ where 𝑠𝑒𝑖 = 𝑠𝜀 1 − −
𝑠𝑒𝑖 𝑛 (𝑛−1)𝑠𝑥2
Regression diagnostics based on residuals
Normality
Homoscedasticity: equal variance assumption
Independence
Outliers
Influential points
Normality
Normal Probability Plot (Q-Q plot)
Heteroscedasticity
0 0
Independence
There are many forms of correlation among
your data. Some can be detected from your
residuals while some cannot.
Time series data is the most commonly
discussed type of correlated data and can be
detected through the plot of residual vs. time.
Plot of Residuals Versus Time Indicating Autocorrelation
(Alternating)
Plot of Residuals Versus Time Indicating Autocorrelation
(Increasing)
Outlier
𝑦ො
Influential Point
Leverage of Observations
1 (𝑥𝑖 −𝑥)ҧ 2
Leverage of observation i: ℎ𝑖 = + σ(𝑥𝑖 −𝑥)ҧ 2
𝑛
Outliers and Influential Points
An outlier is an observation that is unusually small or
unusually large.
+++++++++++
+ +
+ … but, some outliers
+ +
+ +
may be very influential
+
+ + + +
+
+ +
+
+ +
+
+ +
+ +
+
+ + + +
+
+ +
+
An outlier
+
+ +
+
+ +
+ +
+
+ + + +
+
+ +
+
An influential observation
+
+ +
+ … but, some outliers
+ +
+ +
may be very influential
+
+ + + +
+
+ +
+
Procedure for Regression Diagnostics…
1. Develop a model that has a theoretical basis.
2. Gather data for the two variables in the model. Experimental data is
preferred if possible.
3. Draw the scatter plot to determine whether a linear model appears
to be appropriate. Identify possible outliers.
4. Determine the regression equation.
5. Calculate the residuals and check the required conditions.
6. Assess the model’s fit.
7. If the model fits the data, use the regression equation to predict a
particular value of the dependent variable and/or estimate its mean.