Advanced Regression Pres

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

Advanced Regression

1
Agenda

1. Bayesian explanation of regression and regularization


2. Classification algorithms for regression
3. Additional techniques to improve quality of regression

2
Probabilistic view on Linear Regression
• The model:

• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights

• What is the most common Loss function we usually use?

3
Probabilistic view on Linear Regression
• The model:

• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights

• What is the most common Loss function we usually use?

4
Probabilistic view on Linear Regression
• The model:

• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights

• What is the most common Loss function we usually use?

• Why SSE? Why not for example sum of absolute errors?

5
Probabilistic view on Linear Regression
• The model:

• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights

• What is the most common Loss function we usually use?

• Why SSE? Why not for example sum of absolute errors?


o The Markov-Gauss theorem
o Likelihood maximization task 6
Probabilistic view on Linear Regression
Maximum likelihood estimation
• The likelihood function is the joint probability of the observed data viewed
as a function of parameters of the chosen statistical model.
%

𝐿 = 𝑃 𝑥! , 𝑥" , 𝑥# , 𝑥$ … , 𝑥% 𝜃 = ( 𝑝(𝑥& |𝜃)


&'!

• Logarithm of likelihood
%

𝑙𝑜𝑔𝐿 = 𝑙𝑜𝑔 𝑃 𝑥! , 𝑥" , 𝑥# , 𝑥$ … , 𝑥% 𝜃 = 𝑙𝑜𝑔 ( 𝑝 𝑥& 𝜃


&'!
%

= 0 𝑙𝑜𝑔(𝑝 𝑥& 𝜃 )
&'!
• Log-likelihood has the same maximum as likelihood

• Much more convenient to work with


7
Probabilistic view on Linear Regression
• Let’s make our model probabilistic:
o All errors we have are i.i.d. and taken from normal distribution:

o Probability that 𝑦 = 𝑦1 , when input data is 𝑋1 :

8
Probabilistic view on Linear Regression
• Estimate log-Likelihood function:

• If errors are not normally distributed, but by Laplace?

9
Probabilistic view on Regularization
• What if we want to make any assumptions about parameters
we optimize?

10
Probabilistic view on Regularization
• What if we want to make any assumptions about parameters
we optimize?

Maximize not likelihood 𝐿 = 𝑝(𝐷|𝜃), but posterior


probability (bayes theorem):

11
Probabilistic view on Regularization
• What if we want to make any assumptions about parameters
we optimize?

Maximize not likelihood 𝐿 = 𝑝(𝐷|𝜃), but posterior


probability (bayes theorem):

12
Probabilistic view on Regularization
• L2 Regularization (normally distributed weights):

13
Probabilistic view on Regularization
• L1 Regularization (laplace-distributed weights):

14
L1 and L2 Regularization differences
• L2 preventing coefficients of being too large
• L1 helps to make coefficients vector sparser.

15
L1 and L2 Regularization differences
• L2 preventing coefficients of being too large
• L1 helps to make coefficients vector sparser.
• Why L1 zeros out coefficients whereas L2 does not?
o Laplacian and normal distributions:

16
L1 and L2 Regularization differences
• Why L1 zeros out coefficients whereas L2 does not?

o Common picture for intuitive understanding

17
Bayesian Regression summing up:
Normally distributed errors Sum of Squared Errors loss function

Laplace distributed errors Sum of Absolute Errors loss function

Differently distributed errors You can obtain loss functions maximizing


likelihood

Normally distributed params L2 (Ridge) Regularization

Laplace distributed params L1 (Lasso) Regularization

Differently distributed params You can obtain regularization term


maximizing posterior probability using
bayes theorem 18
Other regression algorithms
• Why do we need something different for regression?

o There are a lot of non-linear dependences to estimate.

19
Other regression algorithms
• Why do we need something different for regression?

o There are a lot of non-linear dependences to estimate.

o Using different types of models could be very efficient

20
Other regression algorithms
• Why do we need something different for regression?

o There are a lot of non-linear dependences to estimate.

o Using different types of models could be very efficient

o No Free Lunch Theorem

21
Other regression algorithms
KNN Regression

22
Other regression algorithms
KNN Regression

• Find k nearest points to the element


we want to estimate
• Calculate an average of k nearest
points target values and take it as
our prediction
• Can use distance for averaging

Simple to implement and intuitive

✘ All disadvantages of classification KNN

23
Other regression algorithms
Decision tree regressor

24
Other regression algorithms
Decision tree regressor
• Use Variance (or RMSE, or MSE) as
a criteria for splitting
• Calculate average of elements in
every leaf – the prediction values
set
• Some pruning criteria should be
changed (share of elements by var to
mean relation)

Simple and intuitive to interprete

✘ Limited set of possible values to predict

25
Other regression algorithms
Random forest regressor

26
Other regression algorithms
Random forest regressor

• Change voting of all tree's predictions by


averaging
• Much more prediction values than in
simple DT

All advantages of RF classifier

✘ All disadvantages of RF classifier

27
Other regression algorithms
Gradient boosting

• Only important change – different losses (for continuous


variables) to minimize

28
Other regression algorithms
Gradient boosting

Where is gradient here?

29
Other regression algorithms
Gradient boosting

• Think about optimization not in only


space of parameters, but in space of
functions
• We move in the functional space in f
negative functional gradient direction
• Try to feel analogy:

𝜕𝐶𝑜𝑠𝑡𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛
𝑁𝑒𝑤𝑊𝑒𝑎𝑘𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 → −
𝜕𝑓
𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟( = 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟()! + 𝜌 ∗ 𝑁𝑒𝑤𝑊𝑒𝑎𝑘𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟

30
Other regression algorithms
Gradient boosting

Accurate formulas:

31
Other regression algorithms
Gradient boosting

• Gradient of MSE is just residual's


function
𝜕 "
𝑦 − 𝑓(𝑥) = 2 𝑦 − 𝑓(𝑥)
𝜕𝑓

Very powerful algorithm, able to capture


very complex patterns
✘ Noisy data can bring about overfitting
✘ Problems with parallelization

32
Other regression algorithms
SVM Regression

33
Other regression algorithms
SVM Regression

• Inversed classification task


• Find a stripe as narrow as possible
and containing as more as possible
points inside
• Penalize points outside the stripe

All advantages of SVM classifier

✘ All disadvantages of SVM classifier

34
Other regression algorithms
ElasticNet – combination of Lasso and Ridge Regressions

35
Other regression algorithms
MARS (Multivariate Adaptive Regression Splines) - The model is a
combination of several linear regression models for different regions of
data.

36
Other regression algorithms
Huber Regression - The model using very special loss function (Huber
loss). This loss function is constructed to bring down influence of outliers
in the data.

Also can be considered:


• RANSAC
• Theil Sen

Sklearn documentation
37
Hyperparameter tuning
How to tune hyperparameters?

38
Hyperparameter tuning
•Blind choice - just take default hyperparameters and use it without any
tuning. Try to avoid it in real projects!

39
Hyperparameter tuning
•Grid search - just to iterate over all possible combinations of given
parameters and find the best one.
•Random search - randomly taking the parameters combinations from
their distributions and making several iterations search for the best one.

40
Hyperparameter tuning
•Advanced optimisation (hyperopt for example) - take into account the
history of already evaluated parameter combinations to find combination
to consider next.

More about hyperparameter tuning


41
Thank you!

42

You might also like