Professional Documents
Culture Documents
Regression
Regression
Techniques
y’ = b0 + b1X ± є
є
B1 = slope
b0 (y intercept)
= ∆y/ ∆x
^
Prediction: y
Zero
Independent variable (x)
Simple linear regression
For each observation, the variation can be described as:
y = y^ + ε
Actual = Explained + Error
Prediction error: ε
Observation: y
Prediction: ^y
Zero
Least square regression
A least squares regression selects the line with the lowest total sum of
squared prediction errors.
This value is called the Sum of Squares of Error, or SSE.
Dependent variable
Population mean: y
𝑆𝑆𝑅 𝑆𝑆𝑅
𝑅2 = =
𝑆𝑆𝑇 𝑆𝑆𝑅 + 𝑆𝑆𝐸
The value of R can range between 0 and 1, and the higher its value
the more accurate the regression model is. It is often referred to as a
percentage.
Standard Error
The Standard Error of a regression is a measure of its variability. It
can be used in a similar manner to standard deviation, allowing for
prediction intervals.
y ± 2 standard errors will provide approximately 95% accuracy, and 3
standard errors will provide a 99% confidence interval.
Standard Error is calculated by taking the square root of the average
prediction error.
𝑆𝑆𝐸
𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑒𝑟𝑟𝑜𝑟 =
𝑛−𝑘
Where n is the number of observations in the sample and k is the total
number of variables in the model
The output of a simple regression is the coefficient β and the constant
A.
The equation is then:
y=A+β *x+ε
where ε is the residual error.
β is the per unit change in the dependent variable for each unit change
in the independent variable.
Mathematically:
σ(𝑥 − 𝑥)(𝑦
ҧ − 𝑦)
ത
𝛽=
σ(𝑥 − 𝑥)ҧ 2
𝐴 = 𝑦ത − 𝛽(𝑥)ҧ
Multiple Linear Regression
More than one independent variable can be used to explain variance in
the dependent variable, as long as they are not linearly related.
A multiple regression takes the form:
𝑦 = 𝐴 + 𝛽1 𝑋1 + 𝛽2 𝑋2 + ⋯ … . +𝛽𝑘 𝑋𝑘 + 𝜀
where k is the number of variables, or parameters
Polynomial Regression
It is a technique to fit a nonlinear equation by taking polynomial
functions of independent variable.
a polynomial of degree k in one variable is written as:
𝑦 = 𝛽𝑜 + 𝛽1 𝑋 + 𝛽2 𝑋 2 + ⋯ … . +𝛽𝑘 𝑋 𝑘 + 𝜀
Hence in the situations where the relation between the dependent and
independent variable seems to be non-linear we can
deploy Polynomial Regression Models.
Polynomial Regression
Logistic Regression
Logistic regression is used to find the probability of event=Success
and event=Failure.
We should use logistic regression when the dependent variable is
binary (0/ 1, True/ False, Yes/ No) in nature.
Logistic regression equation is given by
𝑝
𝑦 = log
1−𝑝
Where p is probability of event occurrence
Logistic Regression
It is widely used for classification problems
Logistic regression doesn’t require linear relationship between
dependent and independent variables.
It can handle various types of relationships because it applies a non-
linear log transformation to the predicted odds ratio
Stepwise Regression
This form of regression is used when we deal with multiple independent
variables. In this technique, the selection of independent variables
is done with the help of an automatic process, which involves no human
intervention.
Standard stepwise regression does two things. It adds and removes
predictors as needed for each step.
Ridge Regression
Ridge Regression is a technique used when the data suffers from
multicollinearity ( independent variables are highly correlated).
In multicollinearity, even though the least squares estimates are
unbiased, their variances are large which deviates the observed value
far from the true value. By adding a degree of bias to the regression
estimates, ridge regression reduces the standard errors.
Lasso Regression
Similar to Ridge Regression, Lasso (Least Absolute Shrinkage and
Selection Operator) also penalizes the absolute size of the regression
coefficients.
In addition, it is capable of reducing the variability and improving the
accuracy of linear regression models.
Elastic Net Regression
ElasticNet is hybrid of Lasso and Ridge Regression techniques.
Elastic-net is useful when there are multiple features which are
correlated. Lasso is likely to pick one of these at random, while
elastic-net is likely to pick both.
Regression Analysis Tools
Excel, Mini tab, SPSS
NCSS
MA Canova
Regressit
E views
MATLAB
JMP
Stata
Example: Use the data in Table to obtain the regression line relating
income and alcohol consumption.
Province Income Alcohol
New Found Land 26.8 8.7
Prince Edward Island 27.1 8.4
Nova Scotia 29.5 8.8
New Brunswick 28.4 7.6
Quebec 30.8 8.9
Ontario 36.4 10.0
Manitoba 30.4 9.7
Saskatchewan 29.8 8.9
Alberta 35.1 11.1
British Columbia 32.5 10.9
X Y 𝐗𝟐 𝐘𝟐 XY
= 9.30 – (.276*30.68)
= 9.30 – 8.68 = 0.832
The least Square regression line is
𝑌ത = 0.832 +0.276 X
While the intercept a = 0.832 has little real meaning, the slope of the
line can be interpreted meaningfully. The slope b = 0.276 is positive,
indicating that as income increases, alcohol consumption also increases.
As illustrated in the below graph.
Thanks