Advanced Regression Pres

Advanced Regression
1
Agenda
1. Bayesian explanation of regression and regularization

2. Classification algorithms for regression
3. Additional techniques to improve quality of regression
2
Probabilistic view on Linear Regression
• The model:
• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights
• What is the most common Loss function we usually use?
3
• The model:
• Restrictions:
o Homoscedasticity
4
• The model:
• Restrictions:
o Homoscedasticity
• Why SSE? Why not for example sum of absolute errors?
5
• The model:
• Restrictions:
o Homoscedasticity
• Why SSE? Why not for example sum of absolute errors?

o The Markov-Gauss theorem
o Likelihood maximization task 6
Maximum likelihood estimation
• The likelihood function is the joint probability of the observed data viewed
as a function of parameters of the chosen statistical model.
%
𝐿 = 𝑃 𝑥! , 𝑥" , 𝑥# , 𝑥$ … , 𝑥% 𝜃 = ( 𝑝(𝑥& |𝜃)

&'!
• Logarithm of likelihood
%
𝑙𝑜𝑔𝐿 = 𝑙𝑜𝑔 𝑃 𝑥! , 𝑥" , 𝑥# , 𝑥$ … , 𝑥% 𝜃 = 𝑙𝑜𝑔 ( 𝑝 𝑥& 𝜃

&'!
%
= 0 𝑙𝑜𝑔(𝑝 𝑥& 𝜃 )
&'!
• Log-likelihood has the same maximum as likelihood
• Much more convenient to work with

7
• Let’s make our model probabilistic:
o All errors we have are i.i.d. and taken from normal distribution:
o Probability that 𝑦 = 𝑦1 , when input data is 𝑋1 :
8
• Estimate log-Likelihood function:
• If errors are not normally distributed, but by Laplace?
9
Probabilistic view on Regularization
• What if we want to make any assumptions about parameters
we optimize?
10
we optimize?
Maximize not likelihood 𝐿 = 𝑝(𝐷|𝜃), but posterior

probability (bayes theorem):
11
we optimize?
Maximize not likelihood 𝐿 = 𝑝(𝐷|𝜃), but posterior

probability (bayes theorem):
12
• L2 Regularization (normally distributed weights):
13
• L1 Regularization (laplace-distributed weights):
14
L1 and L2 Regularization differences
• L2 preventing coefficients of being too large
• L1 helps to make coefficients vector sparser.
15
• L2 preventing coefficients of being too large
• L1 helps to make coefficients vector sparser.
• Why L1 zeros out coefficients whereas L2 does not?
o Laplacian and normal distributions:
16
• Why L1 zeros out coefficients whereas L2 does not?
o Common picture for intuitive understanding
17
Bayesian Regression summing up:
Normally distributed errors Sum of Squared Errors loss function
Laplace distributed errors Sum of Absolute Errors loss function
Differently distributed errors You can obtain loss functions maximizing

likelihood
Normally distributed params L2 (Ridge) Regularization
Laplace distributed params L1 (Lasso) Regularization
Differently distributed params You can obtain regularization term

maximizing posterior probability using
bayes theorem 18
Other regression algorithms
• Why do we need something different for regression?
o There are a lot of non-linear dependences to estimate.
19
o Using different types of models could be very efficient
20
o Using different types of models could be very efficient
o No Free Lunch Theorem
21
KNN Regression
22
KNN Regression
• Find k nearest points to the element

we want to estimate
• Calculate an average of k nearest
points target values and take it as
our prediction
• Can use distance for averaging
Simple to implement and intuitive
✘ All disadvantages of classification KNN
23
Decision tree regressor
24
Decision tree regressor
• Use Variance (or RMSE, or MSE) as
a criteria for splitting
• Calculate average of elements in
every leaf – the prediction values
set
• Some pruning criteria should be
changed (share of elements by var to
mean relation)
Simple and intuitive to interprete
✘ Limited set of possible values to predict
25
Random forest regressor
26
Random forest regressor
• Change voting of all tree's predictions by

averaging
• Much more prediction values than in
simple DT
All advantages of RF classifier
✘ All disadvantages of RF classifier
27
Gradient boosting
• Only important change – different losses (for continuous

variables) to minimize
28
Gradient boosting
Where is gradient here?
29
Gradient boosting
• Think about optimization not in only

space of parameters, but in space of
functions
• We move in the functional space in f
negative functional gradient direction
• Try to feel analogy:
𝜕𝐶𝑜𝑠𝑡𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛
𝑁𝑒𝑤𝑊𝑒𝑎𝑘𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 → −
𝜕𝑓
𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟( = 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟()! + 𝜌 ∗ 𝑁𝑒𝑤𝑊𝑒𝑎𝑘𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟
30
Gradient boosting
Accurate formulas:
31
Gradient boosting
• Gradient of MSE is just residual's

function
𝜕 "
𝑦 − 𝑓(𝑥) = 2 𝑦 − 𝑓(𝑥)
𝜕𝑓
Very powerful algorithm, able to capture

very complex patterns
✘ Noisy data can bring about overfitting
✘ Problems with parallelization
32
SVM Regression
33
SVM Regression
• Inversed classification task

• Find a stripe as narrow as possible
and containing as more as possible
points inside
• Penalize points outside the stripe
All advantages of SVM classifier
✘ All disadvantages of SVM classifier
34
ElasticNet – combination of Lasso and Ridge Regressions
35
MARS (Multivariate Adaptive Regression Splines) - The model is a
combination of several linear regression models for different regions of
data.
36
Huber Regression - The model using very special loss function (Huber
loss). This loss function is constructed to bring down influence of outliers
in the data.
Also can be considered:

• RANSAC
• Theil Sen
Sklearn documentation
37
Hyperparameter tuning
How to tune hyperparameters?
38
•Blind choice - just take default hyperparameters and use it without any
tuning. Try to avoid it in real projects!
39
•Grid search - just to iterate over all possible combinations of given
parameters and find the best one.
•Random search - randomly taking the parameters combinations from
their distributions and making several iterations search for the best one.
40
•Advanced optimisation (hyperopt for example) - take into account the
history of already evaluated parameter combinations to find combination
to consider next.
More about hyperparameter tuning

41
Thank you!
42

Advanced Regression Pres

Uploaded by

Copyright:

Available Formats

You might also like

Advanced Regression Pres

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Advanced Regression Pres

Uploaded by

Copyright:

Available Formats

Advanced Regression

1. Bayesian explanation of regression and regularization

• What is the most common Loss function we usually use?

• What is the most common Loss function we usually use?

• What is the most common Loss function we usually use?

• Why SSE? Why not for example sum of absolute errors?

• What is the most common Loss function we usually use?

• Why SSE? Why not for example sum of absolute errors?

𝐿 = 𝑃 𝑥! , 𝑥" , 𝑥# , 𝑥$ … , 𝑥% 𝜃 = ( 𝑝(𝑥& |𝜃)

𝑙𝑜𝑔𝐿 = 𝑙𝑜𝑔 𝑃 𝑥! , 𝑥" , 𝑥# , 𝑥$ … , 𝑥% 𝜃 = 𝑙𝑜𝑔 ( 𝑝 𝑥& 𝜃

• Much more convenient to work with

o Probability that 𝑦 = 𝑦1 , when input data is 𝑋1 :

• If errors are not normally distributed, but by Laplace?

Maximize not likelihood 𝐿 = 𝑝(𝐷|𝜃), but posterior

Maximize not likelihood 𝐿 = 𝑝(𝐷|𝜃), but posterior

o Common picture for intuitive understanding

Laplace distributed errors Sum of Absolute Errors loss function

Differently distributed errors You can obtain loss functions maximizing

Normally distributed params L2 (Ridge) Regularization

Laplace distributed params L1 (Lasso) Regularization

Differently distributed params You can obtain regularization term

o There are a lot of non-linear dependences to estimate.

o There are a lot of non-linear dependences to estimate.

o Using different types of models could be very efficient

o There are a lot of non-linear dependences to estimate.

o Using different types of models could be very efficient

o No Free Lunch Theorem

• Find k nearest points to the element

Simple to implement and intuitive

✘ All disadvantages of classification KNN

Simple and intuitive to interprete

✘ Limited set of possible values to predict

• Change voting of all tree's predictions by

All advantages of RF classifier

✘ All disadvantages of RF classifier

• Only important change – different losses (for continuous

Where is gradient here?

• Think about optimization not in only

• Gradient of MSE is just residual's

Very powerful algorithm, able to capture

• Inversed classification task

All advantages of SVM classifier

✘ All disadvantages of SVM classifier

Also can be considered:

More about hyperparameter tuning

You might also like