Advanced Regression Pres

Advanced Regression


1. Bayesian explanation of regression and regularization

2. Classification algorithms for regression
3. Additional techniques to improve quality of regression

Probabilistic view on Linear Regression
• The model:

• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights

• What is the most common Loss function we usually use?

Probabilistic view on Linear Regression
• The model:

• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights

• What is the most common Loss function we usually use?

• Why SSE? Why not for example sum of absolute errors?

Probabilistic view on Linear Regression
Maximum likelihood estimation
• The likelihood function is the joint probability of the observed data viewed
as a function of parameters of the chosen statistical model.

𝐿 = 𝑃 𝑥! , 𝑥" , 𝑥# , 𝑥$ … , 𝑥% 𝜃 = ( 𝑝(𝑥& |𝜃)


• Logarithm of likelihood

𝑙𝑜𝑔𝐿 = 𝑙𝑜𝑔 𝑃 𝑥! , 𝑥" , 𝑥# , 𝑥$ … , 𝑥% 𝜃 = 𝑙𝑜𝑔 ( 𝑝 𝑥& 𝜃


= 0 𝑙𝑜𝑔(𝑝 𝑥& 𝜃 )
• Log-likelihood has the same maximum as likelihood

• Much more convenient to work with

Probabilistic view on Linear Regression
• Let’s make our model probabilistic:
o All errors we have are i.i.d. and taken from normal distribution:

o Probability that 𝑦 = 𝑦1 , when input data is 𝑋1 :

Probabilistic view on Linear Regression
• Estimate log-Likelihood function:

• If errors are not normally distributed, but by Laplace?

Probabilistic view on Regularization
• What if we want to make any assumptions about parameters
we optimize?

Maximize not likelihood 𝐿 = 𝑝(𝐷|𝜃), but posterior

probability (bayes theorem):

L1 and L2 Regularization differences
• L2 preventing coefficients of being too large
• L1 helps to make coefficients vector sparser.

L1 and L2 Regularization differences
• Why L1 zeros out coefficients whereas L2 does not?

o Common picture for intuitive understanding

Bayesian Regression summing up:
Normally distributed errors Sum of Squared Errors loss function

Laplace distributed errors Sum of Absolute Errors loss function

Differently distributed errors You can obtain loss functions maximizing


Normally distributed params L2 (Ridge) Regularization

Laplace distributed params L1 (Lasso) Regularization

Differently distributed params You can obtain regularization term

maximizing posterior probability using
bayes theorem 18
o Using different types of models could be very efficient

Other regression algorithms
KNN Regression

Other regression algorithms
Decision tree regressor

Other regression algorithms
Random forest regressor

Other regression algorithms
Gradient boosting

• Only important change – different losses (for continuous

variables) to minimize

Other regression algorithms
Gradient boosting

Accurate formulas:

Other regression algorithms
SVM Regression

Other regression algorithms
ElasticNet – combination of Lasso and Ridge Regressions

Also can be considered:

• Theil Sen

Sklearn documentation
Hyperparameter tuning
How to tune hyperparameters?

Hyperparameter tuning
•Blind choice - just take default hyperparameters and use it without any
tuning. Try to avoid it in real projects!

Hyperparameter tuning
•Grid search - just to iterate over all possible combinations of given
parameters and find the best one.
•Random search - randomly taking the parameters combinations from
their distributions and making several iterations search for the best one.

Hyperparameter tuning
•Advanced optimisation (hyperopt for example) - take into account the
history of already evaluated parameter combinations to find combination
to consider next.

More about hyperparameter tuning

Thank you!


