3.Unit 3 ML Part-1 Q&A

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 39

Unit -3

Ensemble Learning and Random Forests: Introduction, Voting Classifiers, Bagging and
Pasting, Random Forests, Boosting, Stacking.

Support Vector Machine: Linear SVM Classification, Nonlinear SVM Classification SVM
Regression, Naïve Bayes Classifiers.

Ensemble Learning and Random Forests: Voting Classifiers

1. What is the difference between hard and soft voting classifiers? Explain them. [7M]
July – 2023 Set -1[Remember]

VOTING CLASSIFIERS
A Voting Classifier is a machine learning model that trains on an ensemble of numerous
models and predicts an output (class) based on their highest probability of chosen class as the
output.
It simply aggregates the findings of each classifier passed into Voting Classifier and predicts
the output class based on the highest majority of voting. The idea is instead of creating
separate dedicated models and finding the accuracy for each them, we create a single model
which trains by these models and predicts output based on their combined majority of voting
for each output class.
Voting Classifier supports two types of votings.

1) Hard Voting

Hard voting, also known as majority voting, involves the simple act of tallying up the
predictions made by each base model and selecting the class that receives the most votes as the
final prediction. It is suitable for classification tasks, where the classes are discrete and
mutually exclusive.

Let’s discuss this with an example.

Suppose, we want to determine whether an email is spam or not. This turns out that it is a
binary classification problem. We got out classes for classification: “spam” (negative class)
and “Not spam” (positive class),
Now let’s say, we can determine these classes using different algorithms. Here, let’s assume
we have used 5 different models to classify the classes, and they gave us the below
probabilities:

Here, 3 classifiers out of 5 voted for email being not spam. On the other hand, 2 out of 5 voted
the email as spam.

According to the hard voting algorithm, the class that receives the most votes is the final
prediction. So, after passing the above data to a hard voting mechanism, it will select the spam
as benign. (not spam)

Diagram of the above example. (Hard Voting.).


2) Soft Voting

Soft voting, also known as weighted voting, takes into account the probability scores of each
base model for each class and calculates the weighted average of these probabilities to make
the final prediction. To combine the predictions, soft voting calculates the average probability
of each class and then declares the winner having the highest weighted probability. It is
suitable for both classification and regression tasks.

Now, Let’s apply the soft voting mechanism to the above email classification problem.

To perform soft voting, we will take the average probability scores for each class across all
models and then select the class with the higher average probability as the final prediction.
Here’s how the calculation looks:

Average Probability for “spam” class: (0.56 + 0.49 + 0.6 + .45 + .39) / 5 = 0.498

Average Probability for “Not spam” class: (0.44 + 0.51 + 0.4 + .55 + .61) / 5 = 0.502

Since the average probability for “Not spam” (0.502) is greater than the average probability
for “spam” (0.498), the soft voting ensemble would predict the email as “Not spam.” In this
example, soft voting allows us to leverage the probability scores from each base model and
make a more nuanced decision based on the collective wisdom of all models. It takes into
account the confidence level of each model and provides a more refined prediction, which can
lead to better overall performance and accuracy in real-world scenarios.

Example of Soft Voting


Major Difference Between Hard Voting and Soft Voting

The primary distinction between hard voting and soft voting lies in the way they combine the
predictions:

 Hard Voting: It takes a simple majority vote to decide the final prediction, based on the
most frequent class predicted by individual models.

 Soft Voting: It considers the probability scores of each class predicted by individual
models and averages them to produce a more refined final prediction. When dealing with
imbalanced datasets, soft voting can help mitigate the bias towards the majority class by
taking into account the probabilities of all classes.

2. Explain about Linear SVM Classification in detail. Compare it with non-linear model.
[7M] July – 2023 Set -1[Understand]

Support vector machines are a set of supervised learning methods used for classification,
regression, and outliers detection. All of these are common tasks in machine learning.

You can use them to detect cancerous cells based on millions of images or you can use them to
predict future driving routes with a well-fitted regression model.

There are specific types of SVMs you can use for particular machine learning problems, like
support vector regression (SVR) which is an extension of support vector classification (SVC).

The main thing to keep in mind here is that these are just math equations tuned to give you the
most accurate answer possible as quickly as possible.

SVMs are different from other classification algorithms because of the way they choose the
decision boundary that maximizes the distance from the nearest data points of all the classes.
The decision boundary created by SVMs is called the maximum margin classifier or the
maximum margin hyper plane.

How an SVM works

A simple linear SVM classifier works by making a straight line between two classes. That
means all of the data points on one side of the line will represent a category and the data points
on the other side of the line will be put into a different category. This means there can be an
infinite number of lines to choose from.

What makes the linear SVM algorithm better than some of the other algorithms, like k-nearest
neighbors, is that it chooses the best line to classify your data points. It chooses the line that
separates the data and is the furthest away from the closet data points as possible.

A 2-D example helps to make sense of all the machine learning jargon. Basically you have
some data points on a grid. You're trying to separate these data points by the category they
should fit in, but you don't want to have any data in the wrong category. That means you're
trying to find the line between the two closest points that keeps the other data points separated.

So the two closest data points give you the support vectors you'll use to find that line. That line
is called the decision boundary.

linear SVM

The decision boundary doesn't have to be a line. It's also referred to as a hyperplane because
you can find the decision boundary with any number of features, not just two.
non-linear SVM using RBF kernel

Types of SVMs

There are two different types of SVMs, each used for different things:

 Simple SVM: Typically used for linear regression and classification problems.

 Kernel SVM: Has more flexibility for non-linear data because you can add more features
to fit a hyperplane instead of a two-dimensional space.

Why SVMs are used in machine learning

SVMs are used in applications like handwriting recognition, intrusion detection, face detection,
email classification, gene classification, and in web pages. This is one of the reasons we use
SVMs in machine learning. It can handle both classification and regression on linear and non-
linear data.

Another reason we use SVMs is because they can find complex relationships between your
data without you needing to do a lot of transformations on your own. It's a great option when
you are working with smaller datasets that have tens to hundreds of thousands of features.
They typically find more accurate results when compared to other algorithms because of their
ability to handle small, complex datasets.
Here are some of the pros and cons for using SVMs.

Pros

 Effective on datasets with multiple features, like financial or medical data.

 Effective in cases where number of features is greater than the number of data points.

 Uses a subset of training points in the decision function called support vectors which
makes it memory efficient.

 Different kernel functions can be specified for the decision function. You can use
common kernels, but it's also possible to specify custom kernels.

Cons

 If the number of features is a lot bigger than the number of data points, avoiding over-
fitting when choosing kernel functions and regularization term is crucial.

 SVMs don't directly provide probability estimates. Those are calculated using an
expensive five-fold cross-validation.

 Works best on small sample sets because of its high training time.

 Since SVMs can use any number of kernels, it's important that you know about a few of
them.

Linear SVM vs Non-Linear SVM

Linear SVM Non-Linear SVM

It can be easily separated with a linear line. It cannot be easily separated with a linear line.

We use Kernels to make non-separable data into


Data is classified with the help of hyperplane.
separable data.

Data can be easily classified by drawing a


We map data into high dimensional space to classify.
straight line.

Applications of SVM

Sentiment analysis, Spam Detection, Handwritten digit recognition, Image recognition


challenges
Bagging and Pasting

3. What is Bagging and pasting? Explain it’s implementation with scikit-learn. [7M]
July – 2023 Set -3[Remember]

In machine learning, sometimes multiple predictors grouped together have a better predictive
performance than anyone of the group alone. These techniques are very popular in
competitions and in production. They are called Ensemble Learning.

There are several ways to group models. They differ in the training algorithm and data used in
each one of them and also how they are grouped. We’ll be talking in the article about two
methods called Bagging and Pasting and how to implement them in scikit-learn.
But before we begin talking about Bagging and Pasting, we have to know what
is Bootstrapping.

Bootstrapping

In statistics, bootstrapping refers to a resample method that consists of repeatedly drawn, with
replacement, samples from data to form other smaller datasets, called bootstrapping samples.
It’s as if the bootstrapping method is making a bunch of simulations to our original dataset so
in some cases we can generalize the mean and the standard deviation.

For example, let’s say we have a set of observations: [2, 4, 32, 8, 16]. If we want each
bootstrap sample containing n observations, the following are valid samples:

 n=3: [32, 4, 4], [8, 16, 2], [2, 2, 2]…

 n=4: [2, 32, 4, 16], [2, 4, 2, 8], [8, 32, 4, 2]…

Since we drawn data with replacement, the observations can appear more than one time in a
single sample.

Bagging & Pasting

Bagging means bootstrap+aggregating and it is a ensemble method in which we first bootstrap


our data and for each bootstrap sample we train one model. After that, we aggregate them with
equal weights. When it’s not used replacement, the method is called pasting.
Out-of-Bag Scoring

If we are using bagging, there’s a chance that a sample would never be selected, while
anothers may be selected multiple time. The probability of not selecting a specific sample is
(1–1/n), where n is the number of samples. Therefore, the probability of not picking n samples
in n draws is (1–1/n)^n. When the value of n is big, we can approximate this probability to 1/e,
which is approximately 0.3678. This means that when the dataset is big enough, 37% of its
samples are never selected and we could use it to test our model. This is called Out-of-Bag
scoring, or OOB Scoring.

Random Forests

As the name suggest, a random forest is an ensemble of decision trees that can be used to
classification or regression. In most cases, it is used bagging. Each tree in the forest outputs a
prediction and the most voted becomes the output of the model. This is helpful to make the
model with more accuracy and stable, preventing overfitting.

Another very useful property of random forests is the ability to measure the relative
importance of each feature by calculating how much each one reduce the impurity of the
model. This is called feature importance.

A scikit-learn Example

To see how bagging works in scikit-learn, we will train some models alone and then aggregate
them, so we can see if it works.

In this example we’ll be using the 1994 census dataset on US income. It contains informations
such as marital status, age, type of work and more. As target column we have a categorical
data type that informs if a salary is less than or equal 50k a year(0) or not(1). Let’s explore the
DataFrame with Pandas’ info method:

RangeIndex: 32561 entries, 0 to 32560


Data columns (total 15 columns):age 32561 non-null int64
workclass 32561 non-null object
fnlwgt 32561 non-null int64
education 32561 non-null object
education_num 32561 non-null int64
marital_status 32561 non-null object
occupation 32561 non-null object
relationship 32561 non-null object
race 32561 non-null object
sex 32561 non-null object
capital_gain 32561 non-null int64
capital_loss 32561 non-null int64
hours_per_week 32561 non-null int64
native_country 32561 non-null object
high_income 32561 non-null int8
dtypes: int64(6), int8(1), object(8)

As we can see, there’s numerical(int64 and int8) and categorical(object) data types. We have
to deal with each type separately to send to the predictor.

Data Preparation

First we load the CSV file and convert the target column to categorical, so when we are
passing all columns to the pipeline we don’t have to worry about the target column.

import numpy as np
import pandas as pd# Load CSV
df = pd.read_csv('data/income.csv')# Convert target to categorical
col = pd.Categorical(df.high_income)
df["high_income"] = col.codes

There’s numerical and categorical columns in our dataset. We need to make different
preprocessing in each one of them. The numerical features need to be normalized and the
categorical features need to be converted to integers. To do this, we define a transformer to
preprocess our data depending on it’s type.

from sklearn.base import BaseEstimator, TransformerMixin


from sklearn.preprocessing import MinMaxScalerclass
PreprocessTransformer(BaseEstimator, TransformerMixin):
def __init__(self, cat_features, num_features):
self.cat_features = cat_features
self.num_features = num_features def fit(self, X, y=None):
return self def transform(self, X, y=None):
df = X.copy() # Treat ? workclass as unknown
df.loc[df['workclass'] == '?', 'workclass'] = 'Unknown'
# Too many categories, just convert to US and Non-US
df.loc[df['native_country']!='United-States','native_country']='non_usa' #
Convert columns to categorical
for name in self.cat_features:
col = pd.Categorical(df[name])
df[name] = col.codes # Normalize numerical features
scaler = MinMaxScaler()
df[self.num_features] = scaler.fit_transform(df[num_features]) return df

The data is then splitted into train and test, so we can see later if our model generalized to
unseen data.
from sklearn.model_selection import train_test_split# Split the dataset into
training and testing
X_train, X_test, y_train, y_test = train_test_split(
df.drop('high_income', axis=1),
df['high_income'],
test_size=0.2,
random_state=42,
shuffle=True,
stratify=df['high_income']
)

Build the Model

Finally, we build our models. First we create a pipeline to preprocess with our custom
transformer, select the best features with SelectKBest and train our predictors.

from sklearn.pipeline import Pipeline


from sklearn.feature_selection import SelectKBest, chi2
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import BaggingClassifierrandom_state = 42
leaf_nodes = 5
num_features = 10
num_estimators = 100# Decision tree for bagging
tree_clf = DecisionTreeClassifier(
splitter='random',
max_leaf_nodes=leaf_nodes,
random_state=random_state
)# Initialize the bagging classifier
bag_clf = BaggingClassifier(
tree_clf,
n_estimators=num_estimators,
max_samples=1.0,
max_features=1.0,
random_state=random_state,
n_jobs=-1
)# Create a pipeline
pipe = Pipeline([
('preproc', PreprocessTransformer(categorical_features, numerical_features)),
('fs', SelectKBest()),
('clf', DecisionTreeClassifier())
])

Since what we are trying to do is see the difference between a simple decision tree and an
ensemble of them, we can use scikit-learn’s GridSearchCV to train all predictors using a
single fit method. We are using AUC and accuracy as scoring and a KFold with 10 splits as
cross-validation.

from sklearn.model_selection import KFold, GridSearchCV


from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, make_scorer# Define our search space
for grid search
search_space = [
{
'clf': [DecisionTreeClassifier()],
'clf__max_leaf_nodes': [128],
'fs__score_func': [chi2],
'fs__k': [10],
},
{
'clf': [RandomForestClassifier()],
'clf__n_estimators': [200],
'clf__max_leaf_nodes': [128],
'clf__bootstrap': [False, True],
'fs__score_func': [chi2],
'fs__k': [10],
}
]# Define scoring
scoring = {'AUC':'roc_auc', 'Accuracy':make_scorer(accuracy_score)}# Define cross
validation
kfold = KFold(n_splits=10, random_state=42)# Define grid search
grid = GridSearchCV(
pipe,
param_grid=search_space,
cv=kfold,
scoring=scoring,
refit='AUC',
verbose=1,
n_jobs=-1
)# Fit grid search
model = grid.fit(X_train, y_train)

The mean of AUC and accuracy for each model tested on GridSearchCV are:

 Single model: AUC = 0.791, Accuracy: 0.798

 Bagging: AUC = 0.869, Accuracy = 0.816

 Pasting: AUC = 0.870, Accuracy = 0.815

 Native random forest: AUC = 0.887, Accuracy = 0.838

As expected, we had better results with the ensemble methods, even if the constituent parts are
the same training algorithm with the same parameters as the single one.

Since the best estimator was the random forest, we can visualize the OOB score and the
features importance by:

best_estimator = grid.best_estimator_.steps[-1][1]
columns = X_test.columns.tolist()print('OOB Score:
{}'.format(best_estimator.oob_score_))
print('Feature Importances')
for i, imp in enumerate(best_estimator.feature_importances_):
print('{}: {:.3f}'.format(columns[i], imp))

Which prints:

OOB Score: 0.8396805896805897

Feature Importances:
age: 0.048
workclass: 0.012
fnlwgt: 0.167
education: 0.138
education_num: 0.001
marital_status: 0.329
occupation: 0.009
relationship: 0.259
race: 0.012
sex: 0.025

4. What is Bagging technique? Explain about Random Forest Algorithm. [7M] July –
2023[Remember]

-Refer 3rd Question for Bagging Technique

Random Forest Algorithm

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in ML. It
is based on the concept of ensemble learning, which is a process of combining multiple
classifiers to solve a complex problem and to improve the performance of the model.

As the name suggests, "Random Forest is a classifier that contains a number of decision
trees on various subsets of the given dataset and takes the average to improve the

predictive accuracy of that dataset." Instead of relying on one decision tree, the random
forest takes the prediction from each tree and based on the majority votes of predictions, and
it predicts the final output.

The below diagram explains the working of the Random Forest algorithm:
USES OF RANDOM FOREST :

o It takes less training time as compared to other algorithms.


o It predicts output with high accuracy, even for the large dataset it runs efficiently.
o It can also maintain accuracy when a large proportion of data is missing.

How does Random Forest algorithm work?

Random Forest works in two-phase first is to create the random forest by combining N
decision tree, and second is to make predictions for each tree created in the first phase.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.

Step-2: Build the decision trees associated with the selected data points (Subsets).

Step-3: Choose the number N for decision trees that you want to build.

Step-4: Repeat Step 1 & 2.

Step-5: For new data points, find the predictions of each decision tree, and assign the new
data points to the category that wins the majority votes.

The working of the algorithm can be better understood by the below example:

Example: Suppose there is a dataset that contains multiple fruit images. So, this dataset is
given to the Random forest classifier. The dataset is divided into subsets and given to each
decision tree. During the training phase, each decision tree produces a prediction result, and
when a new data point occurs, then based on the majority of results, the Random Forest
classifier predicts the final decision.

Consider the below image:


Applications of Random Forest

There are mainly four sectors where Random forest mostly used:

I. Banking: Banking sector mostly uses this algorithm for the identification of loan risk.
II. Medicine: With the help of this algorithm, disease trends and risks of the disease can
be identified.
III. Land Use: We can identify the areas of similar land use by this algorithm.
IV. Marketing: Marketing trends can be identified using this algorithm.

Advantages of Random Forest


o Random Forest is capable of performing both Classification and Regression tasks.
o It is capable of handling large datasets with high dimensionality.
o It enhances the accuracy of the model and prevents the overfitting issue.

Disadvantages of Random Forest


o Although random forest can be used for both classification and regression tasks, it is
not more suitable for Regression tasks.

5. What are the benefits of out-of-bag evaluation? Explain it. [7M] July – 2023 Set -2
[Remember]
Basically, it is nothing but absolute supervised learning based on the concept of creating
independent base learners (multiple decision trees containing bootstrapped samples from the
original dataset) and training them. The bootstrapped samples are created by random
sampling with replacement of dataset(d), with n features, where each sample d` is less than d,
and n` < n. After training is completed (at runtime, always remember), to make a prediction,
all the models in the ensemble are polled, and their results are averaged.

A variety of decision trees gives the algorithm power to generalize itself, rather than
memorizing the data like a toddler, leading to the robustness of the entire model, by reducing
overfitting by a significant amount!
Let us visualize the random sampling with replacement (bootstrapping) process-

Let us suppose we have a dataset containing the names of various data structures and
algorithms. And for the sake of understanding, assume that we have to categorize these as
“data structure” and “algorithm”. Notice that some of the observations (marked with a “ . “ in
the original dataset have been replaced with some of the ticked (✅) observations.

The replaced ones are the ones we did not put in our bag (sample dataset). Hence, those are
simply called Out of Bag observations/data points.

The Out of Bag observations/data points constitute a very important part of the information
(dataset) that, if skipped during model training, may lead to vulnerable predictions!

Thankfully, we make a very good use of the Out Of Bag Data!

The Out of Bag data will now be treated as a validation split for the random forest model.
So, basically for every bootstrapped sample, our decision tree which will be trained on that
particular sample will calculate the prediction values for the OOB points/left-out/replaced
points! The values are also known as OOB Score. OOB score, in a nutshell, is the validation
of OOB data.

How Does This OOB-Score Help?

By now, you must have understood that these multiple base learners (aka decision trees)
calculate their respective OOB-Scores. Every OOB point is rolled through every base learner
which did not contain the OOB points, and a prediction is calculated for each row. Finally,
the OOB score is calculated as the number of correctly predicted rows from the out-of-bag
validation set.

A very important point to note here is, during normal cross-validation techniques, there is a
high probability of data leakage, as almost many of the data points get worked upon while
training ( so we may assume that they have been seen and hence it is CHEATING!!)

However, in OOB Evaluation, none of the OOB samples is seen while the decision trees/
base learners are trained. This is incredibly helpful in avoiding that unwanted data leakage
and lowering the variance, thereby reducing overfitting.

Never memorize anything, otherwise, you will overfit yourself too!

As clever as it may look, OOB evaluation has got its own ups and downs. Let us carefully
examine them-

To enhance OOB Evaluation in your Random Forest algorithm, just make sure to set the
sklearn provided ‘oob_score’ parameter to TRUE, and you are pretty much done! A very
intuitive and smart way to check how good your OOB score is, is to simply capture the
difference between your validation score and OOB score. The lesser the difference, the better
it is!

Now let us quickly revise what we learnt today! (through my awful handwriting)

Random Forests

6. Discuss about Extra trees. Are Extra-Trees slower or faster than regular Random
Forests? Explain. [7M] July – 2023 Set -2[Understand]

Extremely Randomized Trees, or Extra Trees for short, is an ensemble machine learning
algorithm based on decision trees. The Extra Trees algorithm works by creating a large
number of unpruned decision trees from the training dataset. Predictions are made by
averaging the prediction of the decision trees in the case of regression or using majority
voting in the case of classification. The predictions of the trees are aggregated to yield the
final prediction, by majority vote in classification problems and arithmetic average in
regression problems.

There are three main hyperparameters to tune in the algorithm; they are the number of
decision trees in the ensemble, the number of input features to randomly select and
consider for each split point, and the minimum number of samples required in a node to
create a new split point.

The random selection of split points makes the decision trees in the ensemble less
correlated, although this increases the variance of the algorithm. This increase in variance
can be countered by increasing the number of trees used in the ensemble.

Extra Trees vs Random Forest

The two ensembles have a lot in common. Both of them are composed of a large number of
decision trees, where the final decision is obtained taking into account the prediction of
every tree. Specifically, by majority vote in classification problems, and by the arithmetic
mean in regression problems.

The main differences between Extra Trees and Random Forest are the following:

 Random forest uses bootstrap replicas, that is to say, it sub samples the input data with
replacement, whereas Extra Trees use the whole original dataset. In the Extra Trees
sklearn implementation there is an optional parameter that allows users to bootstrap
replicas, but by default, it uses the entire input sample. This may increase variance
because bootstrapping makes it more diversified.

 Another difference is the selection of cut points in order to split nodes. Random Forest
chooses the optimum split while Extra Trees chooses it randomly. However, once the
split points are selected, the two algorithms choose the best one between all the subset
of features. Therefore, Extra Trees adds randomization but still has optimization.

These differences motivate the reduction of both bias and variance. On one hand, using
the whole original sample instead of a bootstrap replica will reduce bias. On the other
hand, choosing randomly the split point of each node will reduce variance.
In terms of computational cost, and therefore execution time, the Extra Trees algorithm
is faster. This algorithm saves time because the whole procedure is the same, but it
randomly chooses the split point and does not calculate the optimal one.

Boosting

7. Define Boosting? Explain about Gradient Boosting technique. [7M] July – 2023 Set
- 3[Remember]

Boosting

Boosting is an ensemble method that enables each member to learn from the preceding
member's mistakes and make better predictions for the future. Unlike the bagging method, in
boosting, all base learners (weak) are arranged in a sequential format so that they can learn
from the mistakes of their preceding learner. Hence, in this way, all weak learners get turned
into strong learners and make a better predictive model with significantly improved
performance.

We have a basic understanding of ensemble techniques in machine learning and their two
common methods, i.e., bagging and boosting. Now, let's discuss a different paradigm of
ensemble learning, i.e., Stacking.

The Boosting Algorithm is one of the most powerful learning ideas introduced in the last
twenty years. Gradient Boosting is an supervised machine learning algorithm used for
classification and regression problems. It is an ensemble technique which uses multiple weak
learners to produce a strong model for regression and classification.

Intuition
Gradient Boosting relies on the intuition that the best possible next model , when combined
with the previous models, minimizes the overall prediction errors. The key idea is to set the
target outcomes from the previous models to the next model in order to minimize the errors.
This is another boosting algorithm(few others are Adaboost, XGBoost etc.).

Input requirement for Gradient Boosting:

1. A Loss Function to optimize.

2. A weak learner to make prediction(Generally Decision tree).

3. An additive model to add weak learners to minimize the loss function.

1. Loss Function

The loss function basically tells how my algorithm, models the data set.In simple terms it is
difference between actual values and predicted values.

Regression Loss functions:

1. L1 loss or Mean Absolute Errors (MAE)

2. L2 Loss or Mean Square Error(MSE)

3. Quadratic Loss

Binary Classification Loss Functions:


1. Binary Cross Entropy Loss

2. Hinge Loss

A gradient descent procedure is used to minimize the loss when adding trees.

2. Weak Learner

Weak learners are the models which is used sequentially to reduce the error generated from
the previous models and to return a strong model on the end.

Decision trees are used as weak learner in gradient boosting algorithm.

3. Additive Model

In gradient boosting, decision trees are added one at a time (in sequence), and existing trees in
the model are not changed.

Understanding Gradient Boosting Step by Step :

This is our data set. Here Age, Sft., Location is independent variables and Price is dependent
variable or Target variable.

Step 1: Calculate the average/mean of the target variable.


Step 2: Calculate the residuals for each sample.

Step 3: Construct a decision tree. We build a tree with the goal of predicting the Residuals.
In the event if there are more residuals then leaf nodes(here its 6 residuals),some residuals will
end up inside the same leaf. When this happens, we compute their average and place that
inside the leaf.

After this tree become like this

.
Step 4: Predict the target label using all the trees within the ensemble.

Each sample passes through the decision nodes of the newly formed tree until it reaches a
given lead. The residual in the said leaf is used to predict the house price.

Calculation above for Residual value (-338) and (-208) in Step 2

Same way we will calculate the Predicted Price for other values

Note: We have initially taken 0.1 as learning rate.

Step 5 : Compute the new residuals

When Price is 350 and 480 Respectively.


With our Single leaf with average value(688) we have the below column of Residual.

With our decision tree ,we ended up the below new residuals.

Step 6: Repeat steps 3 to 5 until the number of iterations matches the number specified by the
hyper parameter(numbers of estimators)

Step 7: Once trained, use all of the trees in the ensemble to make a final prediction as to value
of the target variable. The final prediction will be equal to the mean we computed in Step 1
plus all the residuals predicted by the trees that make up the forest multiplied by the learning
rate.

Here,

LR : Learning Rate
DT: Decision Tree

8. Explain about Ada Boosting technique. [7M] July – 2023 Set -1[Understand]

The following article takes you through an intuitive explanation of the AdaBoost algorithm!
AdaBoost is a Boosting algorithm based on Random Forests.

Deforestation! Oh no!

If you’re already familiar with Random Forests, you’ll recall that a Random Forest is full of
many different Decision Trees. AdaBoost is built using a similar concept, but instead of
trees, the algorithm uses stumps.

stump visualization (picture courtesy of Google)

A stump is a decision tree consisting of a root node + leaf nodes. And that’s it!

Upon first glance, a forest of stumps doesn’t seem like it would end up being an accurate
classification methodology. And it’s completely correct that a single stump doesn’t do a
great job of classifying samples. It’s also true that a ton of independent stumps combined
probably wouldn’t do a great job of classifying a sample. That’s why AdaBoost combines
stump structures with a concept called Boosting.

The Boost in AdaBoost!


This article assumes you’re already familiar with the concept of Boosting. As a quick
reminder, Boosting is an ensemble learning technique where each model is built
sequentially by iterating over the previous model.

In the forest of stumps that makes up AdaBoost, each stump has a different amount of say
in the final classification decision for each sample. This is different from the Random
Forest algorithm, where each tree has an equal vote.

In AdaBoost, order is important. The errors that the first stump makes influences how the
second stump is made, and so on and so forth, until as many errors as possible have been
taken into account.

Let’s start building the first stump of a simplified, made up classification problem to better
understand how this works.

Building Our First Stump

We’ll use our own weather data to simulate the AdaBoost algorithm. In this mini made-up
dataset, the target variable is RainTomorrow, and has two possible values, “Yes”, and “No”.
The goal of our classification problem is to figure out if it will rain the next day based on
temperature, humidity, and whether or not it rained the day on the day in question.

The first step of the algorithm is to assign each sample an equal weight. Since there are four
samples in our simplified example, each sample will be assigned the weight 1/4. After the
first stump is made, the weights will change in order to guide how the next stump is created.

Sample Weather Data (image is my own)


Then, we find the variable that does the best job classifying the samples using the GINI
Index. At this step, we can ignore the weights, since they’re currently all the same.

Let’s assume that Humidity had the highest GINI Index, and choose it for the root node of
our first sample. Say that it correctly classified every sample except for the one highlighted
below.

Highlighted Sample Weather Data (image is my own)

In order to determine how much say this stump has in our final classification, we’ll need to
calculate Total Error and Amount of Say.

Let’s start with total error. The Total Error for a stump is the sum of the weights associated
with the incorrectly classified samples. Here, the total error is 1/4.

It’s important to note that because all of the Sample Weights add up to 1, Total Error will
always be between 0 (for a perfect stump) and 1 (for a horrible stump).

Next, we use the Total Error to calculate the Amount of Say a stump has in final
classification. This is calculated using the equation below.

Amount of Say Equation, formula (image is my own)

To better understand how this equation determines amount of say, we’ll visualize it.
Amount of Say Equation, visualization (image is my own)

After looking at the visualization above, we can see that when a stump does a good job, and
the total error is small, then the Amount of Say is a relatively large, positive value. And
when a stump does a terrible job, and the Total Error is close to 1 (meaning the stump
consistently gives you the opposite answer for classification), then the Amount of Say will
be a large, negative value.
Furthermore, when a stump is no better at classifying than random guessing, such as
flipping a coin, then the Amount of Say will be 0. Note: if the Total Error is 1 or 0, then the
Amount of Say equation is undefined.

We can calculate our Amount of Say using our Total Error of 1/4. This comes out to:

Amount of Say Calculated for example (image is my own)

Now we have our Total Error and our Amount of Say, but we need to incorporate them into
our algorithm so that our next stump takes the errors of our current stump into account.
We’ll do this by using Total Error and Amount of Say to calculate new sample weights.

Calculating New Sample Weights

The main idea behind calculating new sample weights is that whichever samples the first
stump misclassifies get assigned an increased weight before the next stump is built. This
means that when the algorithm goes to make the second tree, it’ll know which samples are
most important to get right this time around.

Let’s calculate our new weights. For calculating new sample weights, we’ll need two
separate equations. First, the formula for increasing the sample weight for the sample that
was incorrectly classified.

New Sample Weight for incorrectly classified sample, formula (image is my own)

We can once again visualize this to help us better understand how our equation works.
New Sample Weight for incorrectly classified samples, visualization (image is my own)

As you can see, when the Amount of Say is relatively large (i.e. the last stump did a good
job of classifying samples) then we scale the previous sample weight with a large number.
This makes the new sample weight much larger than the old one.

And when the amount of say is relatively low, then the previous sample weight is scaled by
a relatively small number. This means that the new sample weight will only be slightly
larger than the old one.

Let’s calculate our new sample weight for our misclassified sample.
New Sample Weight for incorrectly classified sample, calculated (image is my own)

Our second formula is for decreasing the sample weights for the samples that
were correctly classified. The formula is provided below.

New Sample Weight for correctly classified sample, formula (image is my own)

Let’s visualize this equation.

New Sample Weight for correctly classified sample, visualization (image is my own)
When the amount of say is relatively large, we scale the sample weight by a value very
close to zero. This makes the new sample weight very small.

If the amount of say for the last stump is relatively small, then we scale the sample weight
by a value close to 1. This makes the new sample weight only a little smaller than the old
one.

Let’s calculate our new sample weights for our three correctly classified samples. Note that
this value is less than 0.25, which was the value of the first sample weight for these samples.

New Sample Weight for correctly classified sample, calculated (image is my own)

Now that we have our new sample weights, we can put them into our data frame.

Sample Weather Data with New Sample Weights (image is my own)

There’s one more step we have to complete before moving on. We have to normalize the
new sample weights. As a reminder, this means adjusting each sample weight so that all
four sample weights add up to 1. This can be done by dividing each sample weight by the
sum of all new sample weights.
Sample Weather Data with New Sample Weights, normalized (image is my own)

And finally, we no longer have any need for our first sample weights. So, our new sample
weights become our current sample weights!

Sample Weather Data for second stump (image is my own)

Now that we have our sample weather data with sample weights based on the first stump,
we can build our next stump. (And so on, and so forth! This is AdaBoost!).

Setting Up Our Second Stump

To build our second stump, we have two options.

1. We can calculate a weighted Gini Index so that whichever variable does a good job
of classifying previously misclassified samples gets chosen as the root of the second
stump.

2. We can make a new collection of samples that contains duplicate copies of the
samples with the largest sample weights. This is done by making a new, empty
dataset. Then, for each row in the new, empty dataset, we would pick a random
number between 0 and 1 and see where that number falls when we use the sample
weights like a distribution. Then, we would fill the row accordingly. We would
continue this process until all rows in the new collection of samples are filled.

Once the method for setting up the second stump is chosen, this methodology will be used
for all remaining stumps.

Deciding Final Classification

You may be asking yourself — once all stumps have performed classification, how does
this forest of stumps decide on final classification for each sample? Here’s how it works for
one sample. We’ll assume the target variable had two possible answers, just like our sample
data — “Yes” and “No”.

First, we add up the amount of say for all stumps that classified “Yes”. Then, we do the
same for all stumps that classified “No”. Whichever sum is greater wins as the ultimate
classification of the sample.

This process is performed for all samples in the test dataset. Once this process is completed,
so is the AdaBoost classification.

Stacking
9. Illustrate the stacking mechanism in ensemble techniques. [7M] July – 2023 Set -
4[Apply]
Stacking is a strong ensemble learning strategy in machine learning that combines the
predictions of numerous base models to get a final prediction with better performance. It is
also known as stacked ensembles or stacked generalization. This Medium post will discuss
machine learning in detail, addressing its concept, benefits, implementation, and best practices.

What exactly is stacking?

Stacking is a machine learning strategy that combines the predictions of numerous base models,
also known as first-level models or base learners, to obtain a final prediction. It entails training
numerous base models on the same training dataset, then feeding their predictions into a higher-
level model, also known as a meta-model or second-level model, to make the final prediction.
The main idea behind stacking is to combine the predictions of different base models in order to
get more extraordinary predictive performance than utilizing a single model.

What is the process of Stacking?

Stacking, also known as "Stacked Generalization," is a machine learning ensemble strategy that
integrates many models to improve the model’s overall performance. The primary idea of
stacking is to feed the predictions of numerous base models into a higher-level model known as
the meta-model or blender, which then combines them to get the final forecast.

Here’s a detailed description of how stacking works:

Preparing the Data: The first step is to prepare the data for modeling. This entails identifying
the relevant features, cleaning the data, and dividing it into training and validation sets.

Model Selection: The following step is to choose the base models that will be used in the
stacking ensemble. A broad selection of models is typically chosen to guarantee that they
produce different types of errors and complement one another.

Training the Base Models: After selecting the base models, they are trained on the training set.
To ensure diversity, each model is trained using a different algorithm or set of hyperparameters.

Predictions on the Validation Set: Once the base models have been trained, they are used to
make predictions on the validation set.
Developing a Meta Model: The next stage is to develop a meta-model, also known as a meta
learner, which will take the predictions of the underlying models as input and make the final
prediction. Any algorithm, such as linear regression, logistic regression, or even a neural
network, can be used to create this model.

Training the Meta Model: The meta-model is then trained using the predictions given by the
base models on the validation set. The base models’ predictions serve as features for the meta-
model.

Making Test Set Predictions: Finally, the meta-model is used to produce test set predictions.
The basic models’ predictions on the test set are fed into the meta-model, which then makes the
final prediction.

Model Evaluation: The final stage is to assess the stacking ensemble’s performance. This is
accomplished by comparing the stacking ensemble’s predictions to the actual values on the test
set using evaluation measures such as accuracy, precision, recall, F1 score, and so on.

In the end, the goal of stacking is to combine the strengths of various base models by feeding
them into a meta-model, which learns how to weigh and combine their forecasts to generate the
final prediction. This can frequently result in higher performance than utilizing a single model
alone.

A simple way to understand the stacking process

Step 1: Split the training dataset into two parts:

Split the training dataset into two parts

Step 2: Train several base models on the training data:


Train several base models on the Training Data

Step 3: Make predictions using the base models on the hold-out validation data:

Make predictions using the base models on the Hold-Out Validation Data

Step 4: Train the meta-model on the hold-out validation data using the predictions from
the base models as input features:

Train the meta-model on the Hold-Out Validation Data.

Step 5: To make a prediction for new data:

To make a prediction for new data

Final step: Evaluate the performance of the stacked model on a separate test dataset that
was not used during training.

You might also like