Professional Documents
Culture Documents
Mda-Session-7 Simple Linear Regression
Mda-Session-7 Simple Linear Regression
( X X )(Y Y )
CovXY
N 1
Correlation
Correlation coefficient between two R.V.s X and Y, usually denoted by
𝑟 𝑋, 𝑌 𝑜𝑟 𝑟𝑋𝑌 is a numerical measure of linear relationship between them and
is defined as:
𝐶𝑜𝑣(𝑋, 𝑌)
𝑟𝑋𝑌 =
𝜎𝑋 𝜎𝑌
20
Relationship looks
strong
( X X )(Y Y ) 222.44 r
cov XY
11.12
11.12
.713
Covcig .&CHD 11.12 s X sY (2.33)(6.69) 15.59
N 1 21 1
What stories can scatter plot tell?
In 1973, a famous statistician, Francis Anscombe, demonstrated how important it is to visualize the data.
The concept got extended later to create Datasaurus Dozen.
It is a collection of 12 scatterplots with the same means, standard deviations, and correlation coefficient for
X and Y (up to 2 decimal places).
However, the shape of the data is very different from each other. Therefore, the scatterplots tell very
different stories about the behavior and interrelationships of X and Y.
Y Y Y
X X X
Simple linear regression Z Multiple linear regression Non-linear regression
Regression Analysis
Earlier Developments of Regression
The earliest form of regression was the method of least squares, which was published
by Legendre in 1805 and by Gauss in 1809. Legendre and Gauss both applied the
method to the problem of determining, from astronomical observations; the orbits of
bodies about the Sun. Gauss published a further development of the theory of least
squares in 1821, including a version of the Gauss–Markov theorem.
The term "regression" was coined by Francis Galton in the nineteenth century to
describe a biological phenomenon. The phenomenon was that the heights of
descendants of tall ancestors tend to regress down towards a normal average (a
phenomenon also known as regression toward the mean).
Galton’s work was later extended by Udny Yule and Karl Pearson to a more general
statistical context. In the work of Yule and Pearson, the joint distribution of the
response and explanatory variables is assumed to be Gaussian. This assumption was
weakened by R.A. Fisher in his works of 1922 and 1925. Fisher assumed that the
conditional distribution of the response variable is Gaussian, but the joint distribution
need not be.
Galton Board
Scatter plots are used to investigate the possible relationship between the
variables.
Y=α+βx
β=tan(θ)
θ
Note:
There are infinite number of lines (and hence 𝛼𝑠 𝑎𝑛𝑑 𝛽𝑠 )
The concept of regression analysis deal with finding the best relationship between 𝑌 and 𝑥
(and hence best fitted values of 𝛼 and 𝛽) quantifying the strength of that relationship.
Regression Analysis
Given the set 𝑥𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … … , 𝑛 of data involving 𝑛 pairs of (𝑥, 𝑦) values, our objective is to
find “true” or population regression line such that 𝑌 = 𝛼 + 𝛽𝑥+ ∈
Here, ∈ is a random variable with 𝐸 ∈ = 0 and 𝑣𝑎𝑟 ∈ = 𝜎 2 . The quantity 𝜎 2 is often called the
error variance.
Note:
𝐸 ∈ = 0 implies that at a specific 𝑥, the 𝑦 values are distributed around the “true”
regression line 𝑌 = 𝛼 + 𝛽𝑥 (i.e., the positive and negative errors around the true line is
reasonable).
𝛼 and 𝛽 are called regression coefficients.
𝛼 and 𝛽 values are to be estimated from the data.
Assumption on the model
The Linear Regression Model for 𝑖 𝑡ℎ observation
𝑌𝑖 = 𝛼 + 𝛽𝑥𝑖 + 𝜖𝑖 , 𝑖 = 1, 2, … , 𝑛.
Assumptions:
𝐸 𝜖𝑖 = 0 and 𝑉𝑎𝑟 𝜖𝑖 = 𝜎 2
𝐶𝑜𝑣 𝜖𝑖 , 𝜖𝑗 = 0, 𝑖 ≠ 𝑗
𝜖𝑖 (𝑖𝑖𝑑)~ 𝑁 0, 𝜎 2
Assumption on the model
Consequent assumptions on Y:
𝐸 𝑌𝑖 = 𝛼 + 𝛽𝑥𝑖
𝑉𝑎𝑟 𝑌𝑖 = 𝜎 2
The value of 𝜎 determines the extent to which (𝑥, 𝑦) observations deviate from the
regression line.
When 𝜎 is small, most observations will be quite close to the line, but when 𝜎 is
large, there are likely to be some large deviations.
Country Cigarettes CHD Example: Heart Disease and Cigarettes
1 11 26
2 9 21
30
3 9 24
4 9 21 We predict
5 8 19 a CHD rate
6 8 13 of about 14
CHD Mortality per 10,000
7 8 19
8 6 11 20
9 6 23
10 5 15 Regression
11 5 13 Line
12 5 4
13 5 18 10
14 5 12
15 5 3
For a country that
16 4 11
smokes 6 C/A/D…
17 4 15
18 4 6 0
19 3 13 2 4 6 8 10 12
20 3 4
Cigarette Consumption per Adult per Day
21 3 14
True versus Fitted Regression Line
The task in regression analysis is to estimate the regression coefficients 𝛼 and 𝛽.
Suppose, we denote the estimates a for 𝛼 and b for 𝛽. Then the fitted regression line is
𝑌 = 𝑎 + 𝑏𝑥
where 𝑌 is the predicted or fitted value.
Ŷ=a+bx
Y=α+βx
Regression line
Formula:
𝑌 = 𝑎 + 𝑏𝑋
𝑒𝑖 = 𝑌𝑖 − 𝑌𝑖 , 𝑖 = 1,2,3, … … , 𝑛
Ŷ=a+bx
ei
Y Ɛi
Y=α+βx
X
30
20
Prediction
10
0
2 4 6 8 10 12
We are to minimize the value of SSE and hence to determine the parameters of a and b.
Differentiating SSE with respect to a and b, we have
𝜕(𝑆𝑆𝐸) 𝑛
= −2 𝑖=1 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖
𝜕𝑎
𝜕(𝑆𝑆𝐸) 𝑛
= −2 𝑖=1 𝑦𝑖 − 𝑎 − 𝑏𝑥𝑖 . 𝑥𝑖
𝜕𝑏
𝜕(𝑆𝑆𝐸)
For minimum value of SSE, 𝜕𝑎
=0
𝜕(𝑆𝑆𝐸)
𝜕𝑏
=0
Least Square method to estimate 𝛼 and 𝛽
Thus, we set
𝑛 𝑛
𝑛𝑎 + 𝑏 𝑥𝑖 = 𝑦𝑖
𝑖=1 𝑖=1
𝑛 𝑛 2 𝑛
𝑎 𝑖=1 𝑥𝑖 +b 𝑖=1 𝑥𝑖 = 𝑖=1 𝑥𝑖 𝑦𝑖
These two equations can be solved to determine the values of 𝑎 and b, and it can be
calculated that
𝑛
𝑖=1(𝑥𝑖 −𝑥)(𝑦𝑖 − 𝑦) 𝐶𝑜𝑣 𝑋, 𝑌
𝑏= 𝑛 =
(𝑥
𝑖=1 𝑖 −𝑥) 2 𝜎𝑋2
𝑎 = 𝑦 − 𝑏𝑥
Prop 1: 𝑒𝑖 = 𝑦𝑖 − 𝑦𝑖 = 0.
Prop 2: 𝑦𝑖 = 𝑦𝑖 .
Prop 3: 𝑥𝑖 𝑒𝑖 = 0.
Prop 4: 𝑦𝑖 𝑒𝑖 = 0.
Linear Regression
Regression Equation of Y on X:
𝜎𝑌
𝑌 − 𝑦 = 𝑟𝑋𝑌 𝑋 − 𝑥
𝜎𝑋
Regression Equation of X on Y:
𝜎𝑋
𝑋−𝑥 = 𝑟 𝑌−𝑦
𝜎𝑌 𝑋𝑌
Regression Coefficient of Y on X:
𝜎𝑌
𝑏𝑌𝑋 = 𝑟𝑋𝑌
𝜎𝑋
Regression Coefficient of X on Y:
𝜎𝑋
𝑏𝑋𝑌 = 𝑟
𝜎𝑌 𝑋𝑌
For the Data: Heart Disease and Cigarettes
𝐶𝑜𝑣(𝑋, 𝑌) = 11.12619
𝜎2𝑋 = 2.3340142 = 5.447619
𝑏 = 11.12619/5.447619 = 2.042395
𝑎 = 14.52381 − 2.042395 ∗ 5.952381 = 2.366696
the equation of the least-squares line:
𝑦 = 2.367 + 2.042 𝑥
31
Properties of Regression Coefficient
Property 1: Correlation coefficient is the geometric mean between
the regression coefficient.
a) Construct a scatterplot for these data. How would you describe the relationship
between mean call-to shock time and survival rate?
b) Find the equation of the least-squares line.
c) Use the least-squares line to predict survival rate for a community with a mean
call-to-shock time of 10 minutes.
Exercises:
Let x be the size of a house (in square feet) and y be the amount of natural gas used
(therms) during a specified period. Suppose that for a particular community, x and y
are related according to the simple linear regression model with
𝛽 = slope of population regression line = 0.017
𝛼 = y intercept of population regression line = −5.0
Houses in this community range in size from 1000 to 3000 square feet.
The regression equation can be used to predict (or possibly manage) output
data from input/process data
Predicted Values and Residuals
The predicted or fitted values result from substituting each sample x value in
turn into the equation for the least-squares line. This gives
𝑦𝑛 = nth predicted value = 𝑎 + 𝑏𝑥𝑛
The residuals from the least-squares line are the n quantities
𝑒𝑛 = 𝑦𝑛 − 𝑦𝑛 = nth residual
Each residual is the difference between an observed y value and the corresponding
predicted y value.
Plotting the residual: A residual plot is a scatterplot of the (x, residual) pairs.
Isolated points or a pattern of points in the residual plot indicate potential
problems.
Residuals are highly useful for studying whether a given regression model is
appropriate for the data at hand.
Residual : Heart Disease and Cigarettes
The equation of the least-squares line: 𝑦 = 2.367 + 2.042 𝑥
𝑥 𝑦 𝑦 𝑦−𝑦
1 11 26 24.829 1.1669580
2 9 21 20.745 0.2517483
3 9 24 20.745 3.2517483
⋮ ⋮ ⋮ ⋮ ⋮
7 8 19 18.703 0.2941434
8 6 11 14.619 −3.6210664
9 6 23 14.619 8.3789336
⋮ ⋮ ⋮ ⋮ ⋮
13 5 18 12.577 5.4213287
14 5 12 12.577 −0.5786713
15 5 3 12.577 −9.5786713
⋮ ⋮ ⋮ ⋮ ⋮
18 4 6 10.535 −4.5362762
19 3 13 8.493 4.5061189
20 3 4 8.493 −4.4938811
21 3 14 8.493 5.5061189
Residual Plot: Heart Disease and Cigarettes
-2 -2
-3 -3
0 50 100 0 50 100
Residual
3
Residual
3
0
2
-1 -1
-2 -2
Residuals Plot -3
30 Good 40 50
-3
0 Bad 50 100 Meaning / Actions
Pred. Y Pred. Y
1. Residuals vs Residual
3
Residual
3
The relationship between
2 2 X & Y is not a straight line, but a
Each X 1 1
curve. Try a transformation on X,
0 0
-3
-2
-3
or both. Or use X2 in a multiple
3 4 5 6 750 8 9 10 11 100
12 10 20 30
regression.
Xs
0 0 50 100
Time X
Order TimeX
Order
Residual Residual
Nscore
Residual
3 Nscore
Residual
3
time -1
-1
-2
-2
-2
-1
-1
-2
-2
-2
related to time, influences Y. Try
-3
-3
-3 0 50 100
-3
-3
-3 0 50 100
to discover it and include it in a
-3 -230 -1 0 401 2 3 -1 0 1 2 3 1004
Time Order
Residual
Pred. Y
50 0
Time50Order
Residual
Pred. Y multiple regression.
Residual Residual
Residual
3 Residual
3
3. Residuals vs Predicted Y
3 3
2 2
2 2
1 1
This fan shape means the
(Fits)
1 1
0 0
0
-1
0
-1 variation increases as Y gets
Used to check that they are -1
-2
-2
-1
-2
-2
larger (it’s not constant). Try a
-3 -3
Residuals
2 2
1 1
1
0
1
0
The residuals are not Normal. Try
Used to check that residuals 0
-1
-1
0
-1
-1
a transformation on X or Y or
-2 -2
are Normal -2
-3
-2
-3
both.
-3 3 4 5 6 7 8 9 10 11 12 -3 10 20 30
-3 -2-1
X
0 1 2 3 -1 0 1
X 2 3 4
Residual Residual
Variation
A set of data : 𝑥𝑖 , 𝑦𝑖 . 𝑥𝑖 , 𝑦𝑖
𝑦𝑖
𝑌: 𝑝𝑟𝑒𝑑𝑖𝑐𝑡𝑒𝑑 𝑣𝑎𝑙𝑢𝑒 𝑜𝑓 𝑌
𝑌: 𝑚𝑒𝑎𝑛 𝑜𝑓 𝑌 − 𝑣𝑎𝑙𝑢𝑒
𝑦𝑖
𝑥𝑖 , 𝑦𝑖
𝑦
Deviation of the 𝑖 𝑡ℎ observation from the mean= Deviation of the 𝑖 𝑡ℎ observation from the
predicted value + Deviation of the 𝑖 𝑡ℎ predicted value from the mean
=> 𝑌𝑖 − 𝑌 = 𝑌𝑖 − 𝑌𝑖 + 𝑌𝑖 − 𝑌
Total variation in data = 𝑛 𝑖=1 𝑌𝑖 − 𝑌 .
2
The sum of the squared of the differences between each predicted y-value
2
and 𝑌 = 𝑛𝑖=1 𝑌𝑖 − 𝑌
The sum of the squared of the differences between the y-value of each ordered pair and each
2
corresponding predicted y-value= 𝑛𝑖=1 𝑌𝑖 − 𝑌𝑖 .
𝑅2 : Measure of Quality of Fit
A quantity 𝑅2 , is called coefficient of determination is used to measure the proportion of
variability of the fitted model.
𝑛
We have 𝑆𝑆𝐸 = 𝑖=1(𝑦𝑖 − 𝑦)2
It signifies the variability due to error.
Now, let us define the total corrected sum of squares, defined as
𝑛
Y Y Ŷ
P-value for Probability that the slope 0 to 1 If less than .05, the slope is
slope is significant (different significant (different from zero) and
from zero) X is linearly related to Y.
Good Fit
Proof:
𝑛 𝑛
𝑖=1 𝑥𝑖 − 𝑥 𝑌𝑖 𝑥𝑖 − 𝑥
𝑉𝑎𝑟 𝑏 = 𝑉𝑎𝑟 𝑛 = 𝑉𝑎𝑟 𝑛 2 𝑌𝑖
𝑖=1 𝑥𝑖 − 𝑥 2 𝑖=1 𝑥𝑖 − 𝑥
𝑖=1
𝑛
𝑥𝑖 − 𝑥 2 1 𝜎2
= 𝑉𝑎𝑟 𝑌𝑖 = 𝑛 𝜎2 = .
𝑛
𝑥𝑖 − 𝑥 2 2 𝑖=1 𝑥𝑖 − 𝑥 2 𝑆𝑥𝑥
𝑖=1 𝑖=1
𝜎2
𝑉𝑎𝑟 𝑎 = 𝑉𝑎𝑟 𝑌 − 𝑏𝑥 = 𝑉𝑎𝑟 𝑌 + 𝑥2 − 2𝑥𝐶𝑜𝑣 𝑌, 𝑏 .
𝑆𝑥𝑥
𝑌𝑖 𝑛𝑖=1 𝑥𝑖 −𝑥 𝑌𝑖 𝑌𝑖 𝑥𝑖 −𝑥
Now, 𝐶𝑜𝑣 𝑌, 𝑏 = 𝐶𝑜𝑣 𝑛
, 𝑛 2 = 𝐶𝑜𝑣 , 𝑑𝑖 𝑌𝑖 , 𝑑𝑖 = 𝑛 2
𝑖=1 𝑥𝑖 −𝑥 𝑛 𝑖=1 𝑥𝑖 −𝑥
𝑛 𝑛 𝑛
𝑖=1𝑥𝑖 − 𝑥 𝐶𝑜𝑣 𝑌𝑖 , 𝑌𝑖 𝑖=1𝑥𝑖 − 𝑥 𝑉𝑎𝑟 𝑌𝑖 𝜎2
𝐶𝑜𝑣 𝑌, 𝑏 = = = 𝑥𝑖 − 𝑛𝑥 = 0.
𝑛 𝑛𝑖=1 𝑥𝑖 − 𝑥 2 𝑛 𝑛𝑖=1 𝑥𝑖 − 𝑥 2 𝑛 𝑛
𝑖=1 𝑥𝑖 − 𝑥
2
𝑖=1
1 𝑥2
Hence, 𝑉𝑎𝑟 𝑎 = 𝜎2 𝑛 + 𝑆𝑥𝑥
.
Estimating 𝜎 2 and 𝜎
The value of 𝜎 determines the extent to which observed points (𝑥, 𝑦) tend to fall close
to or far away from the population regression line.
𝑆𝑆𝑅𝑒𝑠𝑖𝑑
The unbiased estimator of 𝜎 2 is 𝑠𝑒2 = , i. e. E 𝑠𝑒2 = 𝜎 2
𝑛−2
where 𝑆𝑆𝑅𝑒𝑠𝑖𝑑 = 𝑒𝑖2 = (𝑦 − 𝑦)2 = 𝑆𝑌𝑌 − 𝑏2 𝑆𝑋𝑋 .
𝑆𝑆𝑅𝑒𝑠𝑖𝑑
Residual Mean Square: 𝑀𝑆𝑅𝑒𝑠𝑖𝑑 = .
𝑛−2
𝑆𝑆𝑅𝑒𝑠𝑖𝑑 2
2 ~𝜒𝑛−2
𝜎
Estimating 𝜎 2 and 𝜎
The coefficient of determination is defined as
𝑆𝑆𝑅𝑒𝑠𝑖𝑑
𝑅2 = 1 − ,
𝑆𝑆𝑇𝑜
where 𝑆𝑆𝑇𝑜 = (𝑦 − 𝑦)2 = 𝑆𝑦𝑦 .
the value of 𝜎 represents the magnitude of a typical deviation of a point (x, y) in
the population from the population regression line.
Similarly, 𝑠𝑒 is the magnitude of a typical sample deviation (residual) from the
least-squares line.
The smaller the value of 𝑠𝑒 , the closer the points in the sample fall to the line and
the better the line does in predicting y from x.
Example
A simple linear regression model was used to describe the relationship between y =
hardness of molded plastic and x = amount of time elapsed since the end of the
molding process. Summary quantities included n =15, SSResid = 1235.470, and
SSTo = 25,321.368.
a. Calculate a point estimate of 𝜎. On how many degrees of freedom is the estimate
based?
b. What percentage of observed variation in hardness can be explained by the
simple linear regression model relationship between hardness and elapsed time?
Inferences About the Slope of the Population Regression Line
Properties of the Sampling Distribution of b
1. 𝜇𝑏 = 𝐸 𝑏 = 𝛽, so the sampling distribution of 𝑏 is always centered at the
value of 𝛽. 𝑖. 𝑒 𝑏 is an unbiased statistic for estimating 𝛽.
𝜎
2. The S.D of the statistic 𝑏 is 𝜎𝑏 = .
𝑆𝑥𝑥
3. The statistic 𝑏 is normally distributed.
The normality of the sampling distribution of 𝑏 implies that the standardized
𝑏−𝛽
variable 𝑧 = has a standard normal distribution.
𝜎𝑏
However, inferential methods cannot be based on this variable, because the value
of 𝜎𝑏 is not known.
One way to proceed is to estimate 𝜎 with 𝑠𝑒 , yielding an estimated standard
deviation.
Inferences About the Slope of the Population Regression Line
The estimated standard deviation of the statistic 𝑏 is
𝑠𝑒 𝑆𝑆𝑅𝑒𝑠
𝑠𝑏 = =
𝑆𝑥𝑥 𝑛 − 2 𝑆𝑥𝑥
When the four basic assumptions of the simple linear regression model are
satisfied, the probability distribution of the standardized variable
𝑏−𝛽
𝑡=
𝑠𝑏
is the t-distribution with df 𝑛 − 2.
Confidence Interval for 𝛽 : 𝑏 ± 𝑡𝑛−2,𝛼 . 𝑠𝑏
2
Hypothesis Tests Concerning 𝛽
Null Hypothesis : 𝐻0 : 𝛽 = 0
Alternative Hypothesis: 𝐻1 : 𝛽 ≠ 0
𝑏−ℎ𝑦𝑝𝑜𝑡ℎ𝑒𝑠𝑖𝑧𝑒𝑑 𝑣𝑎𝑙𝑢𝑒
Test Statistic : 𝑡 =
𝑠𝑏
df : 𝑛 − 2.
Reject 𝐻0 : 𝛽 = 0 if 𝑡 > 𝑡𝑛−2,𝛼/2 .
Under 𝐻0 the population regression line is a horizontal line, and the value of
y in the simple linear regression model does not depend on x. That is, 𝑦 =
𝛼 + 𝑒.
In this situation, knowledge of x is of no use in predicting y.
If 𝛽 ≠ 0, there is a useful linear relationship between x and y, and
knowledge of x is useful for predicting y.
The test of 𝐻0 : 𝛽 = 0 versus 𝐻𝑎 : 𝛽 ≠ 0 is called the model utility test for
simple linear regression.
Example
An experiment to study the relationship between x = time spent exercising
(minutes) and y = amount of oxygen consumed during the exercise period
resulted in the following summary statistics.
𝑛 = 20, 𝑥 = 50, 𝑦 = 16,705,
Do these data strongly suggest that there is a negative linear relationship between
temperature and pH? State and test the relevant hypotheses using a significance level of
.01.
Obtain a 95% confidence interval for 𝛼 + 𝛽 40 the mean milk pH when the milk
temperature is 40℃.
Obtain a 95% prediction interval for a single pH observation to be made when milk
temperature = 40℃.
Would you recommend using the data to calculate a 95% confidence interval for the
mean pH when the temperature is 90℃? Why or why not?
SIMPLE LINEAR REGRESSION
USING RStudio
Regression
Correlation helps
To check whether two variables are related
If related
Identify the type & degree of relationship
Regression helps
• To identify the exact form of the relationship
• To model output in terms of input or process variables
Practical Problem:
Exercise1: The data from the pulp drying process is given in the file
DC_Simple_Reg.csv. The file contains data on the dry content
achieved at different dryer temperature. Develop a prediction model
for dry content in terms of dryer temperature.
Path to Solution:
2. Checking Correlation
Scatter Plot
> plot(Temp, DContent)
Path to Solution:
2. Computing Correlation
Temperature 0.9992
Remark:
Correlation between y & x need to be high (preferably 0.8 to 1 to -0.8 to -1.0)
Path to Solution:
3. Performing Regression
> model = lm(DContent ~ Temp)
> summary(model)
p
Statistic Value Criteria Model df F
value
Residual standard 2449
0.07059 Regression 1 0.000
error 7
0.9984 > 0.6
R-squared Residual 40
> 0.6
Adjusted R-squared 0.9984 Total 41
Criteria:
P value < 0.05
Path to Solution:
3. Performing Regression
Temperature
1.293432 0.008264 156.518 0.00
Interpretation
The p value for independent variable need to be < significance level α (generally
α = 0.05)
4. Regression Anova
> anova(model)
ANOVA
Source SS df MS F p value
Temp 122.057 1 122.057 24497 0.000
Residual 0.199 40 0.005
Total 122.256 41
5: Residual Analysis
Normality Check on residuals
> shapiro.test(Res)
5: Residual Analysis
> plot(pred, Res)
> plot(Temp, Res)
Plot the residuals against fitted value. The points in the graph should be scattered
randomly and should not show any trend or pattern. The residuals should not
depend in anyway on the fitted value.
Similarly the residuals shall not depend on x. This can be checked by plotting
residuals vs x. A pattern in this plot is an indication that the residuals are not
independent of x. Instead of x, develop the model with a function of as predictor
(Eg: x2, 1/x, √x, log(x), etc.)
Path to Solution:
Residual Analysis