Week - 2 Day - 2 Machine Learning 2 - 3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 33

Week - 2

Day- 1
Machine Learning 2/3
Sabyasachi Parida
Agenda
● What is a Decision Tree?
○ How Decision Trees work: Splitting criteria ● Gradient Boosted Trees
(Gini, Entropy). a. Introduction to Gradient Boosting.
b. How Gradient Boosted Trees work.
○ Information Gain
c. Comparison with other boosting methods.
○ Pruning methods to prevent overfitting.
Handling Class Imbalance
Bagging vs Boosting
● Understanding class imbalance and its impact on model performance.
○ Introduction to ensemble methods. ● Techniques to handle class imbalance:
○ Bagging (Bootstrap Aggregating): Definition, a. Resampling methods (Oversampling, Undersampling),
b. Synthetic Methods (SMOTE).
Advantages, and Use Cases.
○ Boosting: Definition, Advantages, and Use
Hyper-Parameter Tuning
Cases.

● Importance of hyper-parameter tuning in machine learning.

● Random Forest ● Methods of hyper-parameter tuning:


a. What is a Random Forest? a. Grid Search,
b. How Random Forest works: Aggregation of b. Random Search,
Decision Trees. c. Bayesian Optimization.
c. Advantages of Random Forest over a single
Decision Tree. 2
What is a Decision Tree?

3
Decision Tree

A decision tree is a type of supervised learning algorithm used for both classification and
regression tasks.
It splits the data into subsets based on the value of input features, resulting in a tree-like model of
decisions.

Root Node

Branches

Internal Node

Leaf Node
4
How Decision Trees Work:
Splitting Criteria: The choice of which attribute to split on at each node in the tree is determined by a
splitting criterion.
Common criteria include Gini Index and Entropy.

Gini Index/Impurity
What is the Gini Index?

● Definition: The Gini Index, also known as Gini Impurity or Gini Coefficient, is a metric used to evaluate the
"purity" of a dataset. It is commonly used in decision trees to decide the best feature to split the data at
each node.

● Range: The Gini Index ranges from 0 to 1.
○ 0: Indicates perfect purity, meaning all elements belong to a single class.
○ 1: Indicates maximum impurity, where elements are randomly distributed across various classes.

5
The Gini Index for a node with n classes is calculated as:

where pi is the probability of an element being classified into class i.

Example
Suppose a node has 10 instances, with 4 instances of class A and 6 instances of class B.

[A,A,A,A,B,B,B,B,B,B]

Calculate the probabilities:

P(A) = 4/10 = 0.4


P(B) = 6/10 = 0.6

Compute the Gini Index:


Gini Index = 1 - (P(A)2+P(B)2)
Gini Index = 1 - ((0.4)2+(0.6)2) = 0.48
The Gini Index of 0.48 indicates a mix of classes at this node. 6
What is Entropy?

● Definition: Entropy is a measure of the amount of uncertainty or randomness in a dataset. In the context of decision trees,
it quantifies the impurity or disorder in a set of examples.
● Range: Entropy ranges from 0 to log2(n) where n is the number of classes.
○ 0: Indicates that the dataset is perfectly pure (all instances belong to a single class).
○ Maximum Entropy: Indicates complete randomness, where instances are evenly distributed across classes.

where pi is the probability of an element being classified into class i.

Example
Entropy = −(P(A)log2P(A) + P(B)log2P(B))

Entropy = −(0.4log20.4+0.6log20.6) = 0.972

The entropy of 0.972 indicates a mix of classes at this node, reflecting the uncertainty or impurity.

7
8
Calculating Gini Index and Entropy for Feature Selection

We'll calculate the Gini Index and Entropy for two features: "Feature 1" and "Feature 2".

1. Gini Index for "Feature 1" (Sunny, Overcast, Rainy) 2. Gini Index for "Feature 2" (Hot, Mild, Cool)
● Sunny: ● Hot:
○ Total: 5 (3 No, 2 Yes) ○ Total: 4 (2 No, 2 Yes)
○ Gini(Sunny) = 1 - (P(No)^2 + P(Yes)^2) ○ Gini(Hot) = 1 - (P(No)^2 + P(Yes)^2)
○ P(No) = 3/5, P(Yes) = 2/5 ○ P(No) = 2/4, P(Yes) = 2/4
○ Gini(Sunny) = 1 - ((3/5)^2 + (2/5)^2) = 1 - (9/25 + 4/25) = 1 - 13/25 ○ Gini(Hot) = 1 - ((2/4)^2 + (2/4)^2) = 1 - (4/16 + 4/16) = 1 - 8/16
= 12/25 = 0.48 = 1 - 0.5 = 0.5
● Overcast: ● Mild:
○ Total: 4 (0 No, 4 Yes) ○ Total: 6 (2 No, 4 Yes)
○ Gini(Overcast) = 1 - (P(No)^2 + P(Yes)^2) ○ Gini(Mild) = 1 - (P(No)^2 + P(Yes)^2)
○ P(No) = 0/4, P(Yes) = 4/4 ○ P(No) = 2/6, P(Yes) = 4/6
○ Gini(Overcast) = 1 - (0 + 1) = 0 ○ Gini(Mild) = 1 - ((2/6)^2 + (4/6)^2) = 1 - (4/36 + 16/36) = 1 -
● Rainy: 20/36 = 1 - 5/9 ≈ 0.444
○ Total: 5 (2 No, 3 Yes) ● Cool:
○ Gini(Rainy) = 1 - (P(No)^2 + P(Yes)^2) ○ Total: 4 (1 No, 3 Yes)
○ P(No) = 2/5, P(Yes) = 3/5 ○ Gini(Cool) = 1 - (P(No)^2 + P(Yes)^2)
○ Gini(Rainy) = 1 - ((2/5)^2 + (3/5)^2) = 1 - (4/25 + 9/25) = 1 - 13/25 ○ P(No) = 1/4, P(Yes) = 3/4
= 12/25 = 0.48 ○ Gini(Cool) = 1 - ((1/4)^2 + (3/4)^2) = 1 - (1/16 + 9/16) = 1 -
● Weighted Gini for "Feature 1": 10/16 = 1 - 0.625 = 0.375
○ Gini(Feature 1) = (5/14) * Gini(Sunny) + (4/14) * Gini(Overcast) + ● Weighted Gini for "Feature 2":
(5/14) * Gini(Rainy) ○ Gini(Feature 2) = (4/14) * Gini(Hot) + (6/14) * Gini(Mild) +
○ Gini(Feature 1) = (5/14) * 0.48 + (4/14) * 0 + (5/14) * 0.48 (4/14) * Gini(Cool)
○ Gini(Feature 1) = 0.171 + 0 + 0.171 = 0.342 ○ Gini(Feature 2) = (4/14) * 0.5 + (6/14) * 0.444 + (4/14) * 0.375
○ Gini(Feature 2) ≈ 0.143 + 0.190 + 0.107 ≈ 0.440 9
Calculating Gini Index and Entropy for Feature Selection
2. Entropy for "Feature 2" (Hot, Mild, Cool)
1. Entropy for "Feature 1" (Sunny, Overcast, Rainy) ● Hot:
● Sunny: ○ Total: 4 (2 No, 2 Yes)
○ Total: 5 (3 No, 2 Yes) ○ Entropy(Hot) = - (P(No) * log2(P(No)) + P(Yes) * log2(P(Yes)))
○ Entropy(Sunny) = - (P(No) * log2(P(No)) + P(Yes) * log2(P(Yes))) ○ P(No) = 2/4, P(Yes) = 2/4
○ P(No) = 3/5, P(Yes) = 2/5 ○ Entropy(Hot) = - (2/4 * log2(2/4) + 2/4 * log2(2/4)) = - (0.5 * -1 +
○ Entropy(Sunny) = - (3/5 * log2(3/5) + 2/5 * log2(2/5)) ≈ - (0.6 * 0.5 * -1) = 1
-0.737 + 0.4 * -1.322) ≈ 0.971 ● Mild:
● Overcast: ○ Total: 6 (2 No, 4 Yes)
○ Total: 4 (0 No, 4 Yes) ○ Entropy(Mild) = - (P(No) * log2(P(No)) + P(Yes) * log2(P(Yes)))
○ Entropy(Overcast) = - (P(No) * log2(P(No)) + P(Yes) * log2(P(Yes))) ○ P(No) = 2/6, P(Yes) = 4/6
○ P(No) = 0/4, P(Yes) = 4/4 ○ Entropy(Mild) = - (2/6 * log2(2/6) + 4/6 * log2(4/6)) ≈ - (0.333 *
○ Entropy(Overcast) = 0 (since log2(1)=0log2(1) = 0log2(1)=0) -1.585 + 0.667 * -0.585) ≈ 0.918
● Rainy: ● Cool:
○ Total: 5 (2 No, 3 Yes) ○ Total: 4 (1 No, 3 Yes)
○ Entropy(Rainy) = - (P(No) * log2(P(No)) + P(Yes) * log2(P(Yes))) ○ Entropy(Cool) = - (P(No) * log2(P(No)) + P(Yes) * log2(P(Yes)))
○ P(No) = 2/5, P(Yes) = 3/5 ○ P(No) = 1/4, P(Yes) = 3/4
○ Entropy(Rainy) = - (2/5 * log2(2/5) + 3/5 * log2(3/5)) ≈ - (0.4 * ○ Entropy(Cool) = - (1/4 * log2(1/4) + 3/4 * log2(3/4)) ≈ - (0.25 *
-1.322 + 0.6 * -0.737) ≈ 0.971 -2 + 0.75 * -0.415) ≈ 0.811
● Weighted Entropy for "Feature 1": ● Weighted Entropy for "Feature 2":
○ Entropy(Feature 1) = (5/14) * Entropy(Sunny) + (4/14) * ○ Entropy(Feature 2) = (4/14) * Entropy(Hot) + (6/14) *
Entropy(Overcast) + (5/14) * Entropy(Rainy) Entropy(Mild) + (4/14) * Entropy(Cool)
○ Entropy(Feature 1) = (5/14) * 0.971 + (4/14) * 0 + (5/14) * 0.971 ○ Entropy(Feature 2) = (4/14) * 1 + (6/14) * 0.918 + (4/14) *
○ Entropy(Feature 1) = 0.347 + 0 + 0.347 = 0.694 0.811
○ Entropy(Feature 2) = 0.286 + 0.393 + 0.232 ≈ 0.911
10
Summary

● Feature 1:
○ Gini Index: 0.342
○ Entropy: 0.694
● Feature 2:
○ Gini Index: 0.440
○ Entropy: 0.911

Based on both Gini Index and Entropy calculations,

"Feature 1" (Weather) is a better choice for the first split in the decision tree because it has lower impurity
measures, indicating that it provides a clearer separation of the classes compared to "Feature 2" (Temperature).

11
Information Gain

Information Gain is a concept used in decision trees to determine the most informative feature to split on at each
node.

It quantifies the effectiveness of a particular feature in reducing uncertainty or entropy in the dataset after the split.

Interpretation

● Higher Information Gain indicates that a feature provides more useful information for classifying the data.
● Features with higher Information Gain are preferred as they lead to greater reduction in entropy and thus better decision trees.
12
Pruning methods to prevent overfitting.

Pruning methods in decision trees are essential for preventing overfitting, which occurs when the model captures noise in the
training data and fails to generalize well to unseen data.

Pruning techniques aim to simplify the tree by removing parts of it that do not provide significant predictive power or may be
capturing noise. Here are some common pruning methods:

Pre-pruning Techniques

1. Maximum Depth Limitation


2. Minimum Samples per Leaf
3. Minimum Information Gain

Post-pruning Techniques

1. Reduced Error Pruning


2. Cost-Complexity Pruning (CCP)
3. Subtree Replacement

13
Example and Code Walkthrough

14
Bagging vs Boosting

15
Bagging (Bootstrap Aggregating)

Idea: Bagging is an ensemble method that aims to reduce the variance of a base learning algorithm by training multiple models on
different subsets of the training data and then combining their predictions.

● Training Process:
○ Multiple subsets of the training data are created by sampling with replacement (bootstrap samples).
○ A base learning algorithm (e.g., decision trees) is trained on each subset independently.
● Combination:
○ The final prediction is typically made by averaging the predictions of all the models (for regression) or taking a majority
vote (for classification).
● Advantages:
○ Reduces overfitting by averaging multiple models trained on different data subsets.
○ Works well with unstable models prone to variance (e.g., decision trees).
○ Can be parallelized, as models are trained independently.
● Example: Random Forest algorithm is a popular implementation of Bagging, where decision trees are trained on bootstrap
samples and combined through averaging.

16
Boosting
Idea: Boosting is an ensemble method that iteratively improves the performance of a weak learner (base model) by focusing on the
instances that were previously misclassified.

Training Process:

● Initially, each instance in the training data is given equal weight.


● A weak learner (e.g., decision tree) is trained on the weighted training data.
● Instances that were misclassified receive higher weights, and the next weak learner is trained on the re-weighted data.
● This process is repeated iteratively, with each new model focusing more on the previously misclassified instances.

Combination:

● Predictions are made by combining the predictions of all weak learners using a weighted sum, where the weights are typically
based on the performance of each learner.

Advantages:

● Can achieve high predictive accuracy by focusing on difficult instances in the training data.
● Works well with stable models prone to bias (e.g., shallow decision trees).
● Less prone to overfitting compared to Bagging.

Example: AdaBoost (Adaptive Boosting) and Gradient Boosting are popular implementations of Boosting, where decision trees are
commonly used as weak learners.
17
18
Main Difference:
Bagging aims to reduce variance (overfitting), while Boosting focuses on reducing bias
(underfitting) by improving the performance of a weak learner.

Independence:
In Bagging, models are trained independently, while in Boosting, models are trained sequentially,
with each new model correcting the errors made by the previous ones.

Combination Method:
Bagging typically combines predictions through averaging, while Boosting combines predictions
using weighted sums.

Robustness:
Boosting tends to be more robust and less prone to overfitting compared to Bagging, but it can be
more sensitive to noisy data.

19
Random Forest

20
Random Forest is a powerful ensemble learning method that combines the principles of bagging and
random feature selection to create a robust predictive model.

It builds multiple decision trees during training and outputs the class that is the mode of the classes
(classification) or mean prediction (regression) of the individual trees.

21
Example and Code Walkthrough

22
Gradient Boosted Trees

23
Gradient Boosted Trees (GBT) is a powerful ensemble learning technique that builds a series of
decision trees in a sequential manner, where each subsequent tree corrects the errors of the previous
ones.
Gradient boosting uses gradient descent optimization to minimize a loss function, typically the mean
squared error (MSE) for regression tasks and the log-loss or deviance for classification tasks.

24
Variants of Gradient Boosted Trees:
XGBoost (Extreme Gradient Boosting):

● XGBoost is a scalable and efficient implementation of gradient boosting designed for speed and performance.
● It introduces several enhancements over traditional gradient boosting, including:
○ Regularized boosting (L1 and L2 regularization) to control model complexity and prevent overfitting.
○ Approximate tree learning algorithm to speed up training by reducing computation.
○ Built-in support for handling missing values in the dataset.
○ Parallel and distributed computing capabilities for efficient training on large datasets.

LightGBM (Light Gradient Boosting Machine):

● LightGBM is a gradient boosting framework developed by Microsoft that focuses on speed and memory efficiency.
● It employs a novel gradient-based one-side sampling (GOSS) technique and exclusive feature bundling (EFB) to reduce memory usage and speed up
training.
● LightGBM also supports parallel and distributed computing and offers GPU acceleration for faster training on large-scale datasets.
● It provides state-of-the-art performance in terms of training speed and predictive accuracy and is widely used in production systems.

CatBoost (Categorical Boosting):

● CatBoost is a gradient boosting library developed by Yandex specifically designed to handle categorical features effectively.
● It automatically handles categorical variables without the need for pre-processing such as one-hot encoding or label encoding.
● CatBoost uses an efficient implementation of ordered boosting and supports various types of losses, including logarithmic loss for classification and
RMSE for regression.
● It also incorporates advanced techniques such as ordered boosting, which significantly improves the quality of predictions for ordered tasks.
● CatBoost is known for its ease of use and strong performance in competitions and real-world applications. 25
Example and Code Walkthrough

26
Class Imbalance

Class imbalance refers to the situation in a classification problem where the distribution of
class labels in the dataset is skewed, meaning one class (the minority class) is significantly
less represented than the other class or classes (the majority class or classes).

27
Handling Class Imbalance
Resampling Techniques:

● Undersampling
● Oversampling:
a. Random Oversampling
b. SMOTE (Synthetic Minority Over-sampling Technique),
c. ADASYN (Adaptive Synthetic Sampling) generate synthetic samples to balance the classes.
● Combined Sampling: Combine undersampling and oversampling techniques to balance the dataset more effectively.

Algorithmic Approaches:

● Cost-sensitive Learning
● Ensemble Methods: Use ensemble methods like Random Forest, Gradient Boosted Trees, or AdaBoost
● Algorithmic Modifications: Some algorithms have built-in mechanisms to handle class imbalance, such as class-weighted support in SVM
or the "class_weight" parameter in Scikit-learn's classifiers.

Evaluation Metrics:

● Use evaluation metrics that are more robust to class imbalance, such as precision, recall, F1-score, area under the ROC curve (AUC-ROC),
or area under the precision-recall curve (AUC-PR).
● Avoid using accuracy as the sole metric, as it can be misleading in imbalanced datasets.

Data Preprocessing:

● Feature Engineering: Create new features or transformations that help the algorithm better discriminate between classes.
28
● Dimensionality Reduction: Reduce the dimensionality of the dataset to focus on the most informative features and reduce noise.
29
HyperParameter Tuning

30
Hyper-parameters are parameters that are set before the learning process begins and govern the
training process of a model.

Unlike model parameters (which are learned from the training data), hyper-parameters are not
directly learned from the data and must be specified before training.

Examples of Hyper-Parameters:

Decision Trees

● Max Depth: The maximum depth of the tree.


● Min Samples Split: The minimum number of samples required to split an internal node.
● Min Samples Leaf: The minimum number of samples required to be at a leaf node.
● Max Features: The number of features to consider when looking for the best split.
● Criterion: The function to measure the quality of a split (e.g., Gini impurity or entropy).

31
Methods of Hyper-Parameter Tuning

1. Grid Search

● Description: Grid Search is an exhaustive search method where a predefined set of hyper-parameters is specified, and the
model is trained and evaluated for each possible combination of these hyper-parameters.

2. Random Search

● Description: Random Search is a more efficient alternative to Grid Search.


● Instead of exhaustively searching through all combinations, it samples a fixed number of hyper-parameter combinations
randomly.

3. Bayesian Optimization

● Description: Bayesian Optimization is an advanced hyper-parameter tuning method that builds a probabilistic model of the
objective function and uses it to select the most promising hyper-parameters to evaluate next.
● Process:
○ Define the objective function to optimize (e.g., validation accuracy).
○ Construct a surrogate probabilistic model (e.g., Gaussian Process) of the objective function.
○ Use an acquisition function to determine the next set of hyper-parameters to evaluate, balancing exploration and
exploitation.
○ Update the surrogate model with the new data.
○ Repeat steps 3 and 4 until convergence or a stopping criterion is met.
32
Example and Code Walkthrough

33

You might also like