Professional Documents
Culture Documents
Advanced Regression Pres
Advanced Regression Pres
Advanced Regression Pres
1
Agenda
2
Probabilistic view on Linear Regression
• The model:
• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights
3
Probabilistic view on Linear Regression
• The model:
• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights
4
Probabilistic view on Linear Regression
• The model:
• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights
5
Probabilistic view on Linear Regression
• The model:
• Restrictions:
o Homoscedasticity
o Expectation of error is zero
o There is no any correlation between errors
• Minimizing the Loss function to find best weights
• Logarithm of likelihood
%
= 0 𝑙𝑜𝑔(𝑝 𝑥& 𝜃 )
&'!
• Log-likelihood has the same maximum as likelihood
8
Probabilistic view on Linear Regression
• Estimate log-Likelihood function:
9
Probabilistic view on Regularization
• What if we want to make any assumptions about parameters
we optimize?
10
Probabilistic view on Regularization
• What if we want to make any assumptions about parameters
we optimize?
11
Probabilistic view on Regularization
• What if we want to make any assumptions about parameters
we optimize?
12
Probabilistic view on Regularization
• L2 Regularization (normally distributed weights):
13
Probabilistic view on Regularization
• L1 Regularization (laplace-distributed weights):
14
L1 and L2 Regularization differences
• L2 preventing coefficients of being too large
• L1 helps to make coefficients vector sparser.
15
L1 and L2 Regularization differences
• L2 preventing coefficients of being too large
• L1 helps to make coefficients vector sparser.
• Why L1 zeros out coefficients whereas L2 does not?
o Laplacian and normal distributions:
16
L1 and L2 Regularization differences
• Why L1 zeros out coefficients whereas L2 does not?
17
Bayesian Regression summing up:
Normally distributed errors Sum of Squared Errors loss function
19
Other regression algorithms
• Why do we need something different for regression?
20
Other regression algorithms
• Why do we need something different for regression?
21
Other regression algorithms
KNN Regression
22
Other regression algorithms
KNN Regression
23
Other regression algorithms
Decision tree regressor
24
Other regression algorithms
Decision tree regressor
• Use Variance (or RMSE, or MSE) as
a criteria for splitting
• Calculate average of elements in
every leaf – the prediction values
set
• Some pruning criteria should be
changed (share of elements by var to
mean relation)
25
Other regression algorithms
Random forest regressor
26
Other regression algorithms
Random forest regressor
27
Other regression algorithms
Gradient boosting
28
Other regression algorithms
Gradient boosting
29
Other regression algorithms
Gradient boosting
𝜕𝐶𝑜𝑠𝑡𝐹𝑢𝑛𝑐𝑡𝑖𝑜𝑛
𝑁𝑒𝑤𝑊𝑒𝑎𝑘𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟 → −
𝜕𝑓
𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟( = 𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟()! + 𝜌 ∗ 𝑁𝑒𝑤𝑊𝑒𝑎𝑘𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑜𝑟
30
Other regression algorithms
Gradient boosting
Accurate formulas:
31
Other regression algorithms
Gradient boosting
32
Other regression algorithms
SVM Regression
33
Other regression algorithms
SVM Regression
34
Other regression algorithms
ElasticNet – combination of Lasso and Ridge Regressions
35
Other regression algorithms
MARS (Multivariate Adaptive Regression Splines) - The model is a
combination of several linear regression models for different regions of
data.
36
Other regression algorithms
Huber Regression - The model using very special loss function (Huber
loss). This loss function is constructed to bring down influence of outliers
in the data.
Sklearn documentation
37
Hyperparameter tuning
How to tune hyperparameters?
38
Hyperparameter tuning
•Blind choice - just take default hyperparameters and use it without any
tuning. Try to avoid it in real projects!
39
Hyperparameter tuning
•Grid search - just to iterate over all possible combinations of given
parameters and find the best one.
•Random search - randomly taking the parameters combinations from
their distributions and making several iterations search for the best one.
40
Hyperparameter tuning
•Advanced optimisation (hyperopt for example) - take into account the
history of already evaluated parameter combinations to find combination
to consider next.
42