Unit-5 Decision Trees and Ensemble Learning

Unit-5 Decision
Trees and
Ensemble
Learning
Decision Trees
• Tree based learning algorithms are considered to
be one of the best and mostly used supervised
learning methods.
• Tree based methods empower predictive models
with high accuracy, stability and ease of
interpretation. Unlike linear models, they map
non-linear relationships quite well.
• They are adaptable at solving any kind of
problem at hand (classification or regression).
• Methods like decision trees, random forest,
gradient boosting are being popularly used in all
kinds of data science problems.
• Decision tree is a type of supervised learning
algorithm (having a pre-defined target
variable) that is mostly used in classification
problems.
• It works for both categorical and continuous
input and output variables. In this technique,
we split the population or sample into two or
more homogeneous sets (or sub-populations)
based on most significant splitter /
differentiator in input variables.
• Let’s say we have a sample of 30 students
with three variables Gender (Boy/ Girl),
Class( IX/ X) and Height (5 to 6 ft). 15 out
of these 30 play cricket in leisure time.
• Now, I want to create a model to predict
who will play cricket during leisure
period? In this problem, we need to
segregate students who play cricket in
their leisure time based on highly
significant input variable among all three.
• This is where decision tree helps, it will segregate
the students based on all values of three variable
and identify the variable, which creates the best
homogeneous sets of students (which are
heterogeneous to each other).
• In the snapshot below, you can see that variable
Gender is able to identify best homogeneous sets
compared to the other two variables.
• As mentioned above, decision tree identifies the
most significant variable and it’s value that gives
best homogeneous sets of population. Now the
question which arises is, how does it identify the
variable and the split?
• Types of decision tree is based on the type of
target variable we have. It can be of two types:
• Categorical Variable Decision Tree: Decision
Tree which has categorical target variable
then it called as categorical variable decision
tree. Example:- In above scenario of student
problem, where the target variable was
“Student will play cricket or not” i.e. YES or
NO.
• Continuous Variable Decision Tree: Decision
Tree has continuous target variable then it is
called as Continuous Variable Decision Tree.
• Example:- Let’s say we have a problem to
predict whether a customer will pay his renewal
premium with an insurance company (yes/ no).
Here we know that income of customer is
a significant variable but insurance company
does not have income details for all customers.
• Now, as we know this is an important variable,
then we can build a decision tree to predict
customer income based on occupation, product
and various other variables. In this case, we are
predicting values for continuous variable.
• Let’s look at the basic terminology used with
Decision trees:
• Root Node: It represents entire population or
sample and this further gets divided into two
or more homogeneous sets.
• Splitting: It is a process of dividing a node into
two or more sub-nodes.
• Decision Node: When a sub-node splits into
further sub-nodes, then it is called decision
node.
• Leaf/ Terminal Node: Nodes do not split is
called Leaf or Terminal node.
• Pruning: When we remove sub-nodes of a
decision node, this process is called pruning.
You can say opposite process of splitting.
• Branch / Sub-Tree: A sub section of entire tree
is called branch or sub-tree.
• Parent and Child Node: A node, which is
divided into sub-nodes is called parent node of
sub-nodes where as sub-nodes are the child of
parent node.
• These are the terms commonly used for decision
trees. As we know that every algorithm has
advantages and disadvantages, below are the
important factors which one should know.
Regression Trees vs Classification Trees
• We all know that the terminal nodes (or
leaves) lies at the bottom of the decision tree.
This means that decision trees are typically
drawn upside down such that leaves are the
the bottom & roots are the tops (shown
below).
• Regression trees are used when dependent variable
is continuous. Classification trees are used when
dependent variable is categorical.
• In case of regression tree, the value obtained by
terminal nodes in the training data is the mean
response of observation falling in that region. Thus, if
an unseen data observation falls in that region, we’ll
make its prediction with mean value.
• In case of classification tree, the value (class)
obtained by terminal node in the training data is the
mode of observations falling in that region. Thus, if
an unseen data observation falls in that region, we’ll
make its prediction with mode value.
• Both the trees divide the predictor space (independent
variables) into distinct and non-overlapping regions. For the
sake of simplicity, you can think of these regions as high
dimensional boxes or boxes.
• Both the trees follow a top-down greedy approach known as
recursive binary splitting. We call it as ‘top-down’ because it
begins from the top of tree when all the observations are
available in a single region and successively splits the
predictor space into two new branches down the tree. It is
known as ‘greedy’ because, the algorithm cares (looks for
best variable available) about only the current split, and not
about future splits which will lead to a better tree.
• This splitting process is continued until a user defined
stopping criteria is reached. For example: we can tell the the
algorithm to stop once the number of observations per node
becomes less than 50.
• In both the cases, the splitting process results
in fully grown trees until the stopping criteria
is reached. But, the fully grown tree is likely to
overfit data, leading to poor accuracy on
unseen data. This bring ‘pruning’. Pruning is
one of the technique used tackle overfitting.
We’ll learn more about it in following section.
How does a tree decide where to split?
• The decision of making strategic splits heavily
affects a tree’s accuracy. The decision criteria is
different for classification and regression trees.
• Decision trees use multiple algorithms to decide
to split a node in two or more sub-nodes. The
creation of sub-nodes increases the
homogeneity of resultant sub-nodes. In other
words, we can say that purity of the node
increases with respect to the target variable.
Decision tree splits the nodes on all available
variables and then selects the split which results
in most homogeneous sub-nodes.
Gini Index
• Gini index says, if we select two items from a
population at random then they must be of
same class and probability for this is 1 if
population is pure.
• It works with categorical target variable
“Success” or “Failure”.
• It performs only Binary splits
• Higher the value of Gini higher the
homogeneity.
• CART (Classification and Regression Tree) uses
Gini method to create binary splits.
Steps to Calculate Gini for a split
• Calculate Gini for sub-nodes, using formula sum
of square of probability for success and failure
(p^2+q^2).
• Calculate Gini for split using weighted Gini
score of each node of that split
• Example: – Referring to example used above,
where we want to segregate the students based
on target variable ( playing cricket or not ). In
the snapshot below, we split the population using
two input variables Gender and Class. Now, I
want to identify which split is producing more
homogeneous sub-nodes using Gini index.
Split on Gender:
• Calculate, Gini for sub-node Female =
(0.2)*(0.2)+(0.8)*(0.8)=0.68
• Gini for sub-node Male =
(0.65)*(0.65)+(0.35)*(0.35)=0.55
• Calculate weighted Gini for Split Gender =
(10/30)*0.68+(20/30)*0.55 = 0.59
• Similar for Split on Class:
• Gini for sub-node Class IX =
(0.43)*(0.43)+(0.57)*(0.57)=0.51
• Gini for sub-node Class X =
(0.56)*(0.56)+(0.44)*(0.44)=0.51
• Calculate weighted Gini for Split Class =
(14/30)*0.51+(16/30)*0.51 = 0.51
• Above, you can see that Gini score for Split on
Gender is higher than Split on Class, hence,
the node split will take place on Gender.
• Chi-Square
• It is an algorithm to find out the statistical
significance between the differences between sub-
nodes and parent node. We measure it by sum of
squares of standardized differences between
observed and expected frequencies of target
variable.
• It works with categorical target variable “Success” or
“Failure”.
• It can perform two or more splits.
• Higher the value of Chi-Square higher the statistical
significance of differences between sub-node and
Parent node.
• Chi-Square of each node is calculated using
formula,
• Chi-square = ((Actual – Expected)^2 /
Expected)^1/2
• It generates tree called CHAID (Chi-square Automatic
Interaction Detector)
• Steps to Calculate Chi-square for a split:
• Calculate Chi-square for individual node by
calculating the deviation for Success and Failure
both
• Calculated Chi-square of Split using Sum of all Chi-
square of success and Failure of each node of the
split
• Example: Let’s work with above example that we have used
to calculate Gini.
• Split on Gender:
• First we are populating for node Female, Populate the
actual value for “Play Cricket” and “Not Play Cricket”,
here these are 2 and 8 respectively.
• Calculate expected value for “Play Cricket” and “Not
Play Cricket”, here it would be 5 for both because
parent node has probability of 50% and we have
applied same probability on Female count(10).
• Calculate deviations by using formula, Actual –
Expected. It is for “Play Cricket” (2 – 5 = -3)
and for “Not play cricket” ( 8 – 5 = 3).
• Calculate Chi-square of node for “Play
Cricket” and “Not Play Cricket” using formula
with formula, = ((Actual – Expected)^2 /
Expected)^1/2. You can refer below table for
calculation.
• Follow similar steps for calculating Chi-square
value for Male node.
• Now add all Chi-square values to calculate Chi-
square for split Gender.
• Split on Class:
• Perform similar steps of calculation for split on
Class and you will come up with below table.
• Above, you can see that Chi-square also
identify the Gender split is more significant
compare to Class.
• What are the key parameters of tree modeling and
how can we avoid over-fitting in decision trees?
• Overfitting is one of the key challenges faced while
modeling decision trees. If there is no limit set of a
decision tree, it will give you 100% accuracy on training
set because in the worse case it will end up making 1
leaf for each observation. Thus, preventing overfitting
is pivotal while modeling a decision tree and it can be
done in 2 ways:
• Setting constraints on tree size
• Tree pruning
• Setting Constraints on Tree Size
• This can be done by using various parameters which are used
to define a tree. First, lets look at the general structure of a
decision tree:
• The parameters used for defining a tree are
further explained below.
• Minimum samples for a node split
– Defines the minimum number of samples (or
observations) which are required in a node to be
considered for splitting.
– Used to control over-fitting. Higher values prevent
a model from learning relations which might be
highly specific to the particular sample selected
for a tree.
– Too high values can lead to under-fitting hence, it
should be tuned using CV.
• Minimum samples for a terminal node (leaf)
– Defines the minimum samples (or observations)
required in a terminal node or leaf.
– Used to control over-fitting similar to
min_samples_split.
– Generally lower values should be chosen for
imbalanced class problems because the regions in
which the minority class will be in majority will be
very small.
• Maximum depth of tree (vertical depth)
– The maximum depth of a tree.
– Used to control over-fitting as higher depth will
allow model to learn relations very specific to a
particular sample. Should be tuned using CV.
• Maximum number of terminal nodes
– The maximum number of terminal nodes or leaves
in a tree.
– Can be defined in place of max_depth. Since binary
trees are created, a depth of ‘n’ would produce a
maximum of 2^n leaves.
• Maximum features to consider for split
– The number of features to consider while searching
for a best split. These will be randomly selected.
– As a thumb-rule, square root of the total number of
features works great but we should check upto 30-
40% of the total number of features.
– Higher values can lead to over-fitting but depends
on case to case
• Tree Pruning
• As discussed earlier, the technique of setting
constraint is a greedy-approach. In other
words, it will check for the best split
instantaneously and move forward until one
of the specified stopping condition is reached.
Let’s consider the following case when you’re
driving:
There are 2 lanes:
• A lane with cars moving at 80km/h
• A lane with trucks moving at 30km/h
• At this instant, you are the yellow car and you have 2 choices:
• Take a left and overtake the other 2 cars quickly
• Keep moving in the present lane
• Lets analyze these choice. In the former choice, you’ll
immediately overtake the car ahead and
reach behind the truck and start moving at 30 km/h,
looking for an opportunity to move back right. All
cars originally behind you move ahead in the
meanwhile. This would be the optimum choice if
your objective is to maximize the distance covered in
next say 10 seconds.
• This is exactly the difference between normal decision
tree & pruning. A decision tree with constraints won’t see
the truck ahead and adopt a greedy approach by taking a
left. On the other hand if we use pruning, we in effect
look at a few steps ahead and make a choice.
• So we know pruning is better. But how to implement it in
decision tree? The idea is simple.
• We first make the decision tree to a large depth.
• Then we start at the bottom and start removing leaves
which are giving us negative returns when compared
from the top.
• Suppose a split is giving us a gain of say -10 (loss of 10)
and then the next split on that gives us a gain of 20. A
simple decision tree will stop at step 1 but in pruning, we
will see that the overall gain is +10 and keep both leaves.
Binary decisions
• Let's consider an input dataset X:
Every vector is made up of m features, so each of them can be a good candidate

to create a node based on the (feature, threshold) tuple:
• According to the feature and the threshold, the structure of the tree will change.
• For example, let's consider a class of students where
all males have dark hair and all females have blonde
hair, while both subsets have samples of different
sizes. If our task is to determine the composition of
the class, we can start with the following subdivision:
• However, the block Dark color? will contain
both males and females (which are the targets
we want to classify). This concept is expressed
using the term purity (or, more often, its
opposite concept, impurity).
• An ideal scenario is based on nodes where the
impurity is null so that all subsequent
decisions will be taken only on the remaining
features. In our example, we can simply start
from the color block:
• The two resulting sets are now pure according
to the color feature, and this can be enough
for our task. If we need further details, such as
hair length, other nodes must be added; their
impurity won't be null because we know that
there are, for example, both male and female
students with long hair.
• More formally, suppose we define the
selection tuple as:
• Here, the first element is the index of the feature we
want to use to split our dataset at a certain node (it
will be the entire dataset only at the beginning; after
each step, the number of samples decreases), while
the second is the threshold that determines left and
right branches.
• The choice of the best threshold is a fundamental
element because it determines the structure of the
tree and, therefore, its performance.
• The goal is to reduce the residual impurity in the
least number of splits so as to have a very short
decision path between the sample data and the
classification result.
• We can also define a total impurity measure by
considering the two branches:
• Here, D is the whole dataset at the selected node,

Dleft and Dright are the resulting subsets (by
applying the selection tuple), and the I are impurity
measures.
Impurity measures
• To define the most used impurity measures, we need
to consider the total number of target classes:
• In a certain node j, we can define the probability p(i|j)where i is
an index [1, n] associated with each class. In other words,
according to a frequentist approach, this value is the ratio
between the number of samples belonging to class i and the
total number of samples belonging to the selected node
• Gini impurity index
The Gini impurity index is defined as:
• Here, the sum is always extended to all classes. This is a very

common measure and it's used as a default value by scikit-learn.
Given a sample, the Gini impurity measures the probability of a
misclassification if a label is randomly chosen using the
probability distribution of the branch. The index reaches its
minimum (0.0) when all the samples of a node are classified into
a single category.
• Cross-entropy impurity index
• The cross-entropy measure is defined as:
• This measure is based on information theory,

and assumes null values only when samples
belonging to a single class are present in a
split, while it is maximum when there's a
uniform distribution among classes (which is
one of the worst cases in decision trees
because it means that there are still many
decision steps until the final classification).
• This index is very similar to the Gini impurity, even
though, more formally, the cross-entropy allows you
to select the split that minimizes the uncertainty
about the classification, while the Gini impurity
minimizes the probability of misclassification.
• In concept of mutual information I(X; Y) = H(X) -
H(X|Y) as the amount of information shared by both
variables, thereby reducing the uncertainty about X
provided by the knowledge of Y. We can use this to
define the information gain provided by a split:
•
•
• When growing a tree, we start by selecting the
split that provides the highest information gain
and proceed until one of the following conditions
is verified:
• All nodes are pure
• The information gain is null
• The maximum depth has been reached
• Misclassification impurity index
• The misclassification impurity is the simplest
index, defined as:
•
• In terms of quality performance, this index is
not the best choice because it's not
particularly sensitive to different probability
distributions
• Decision tree classification with scikit-learn
• scikit-learn contains the DecisionTreeClassifier class, which
can train a binary decision tree with Gini and cross-entropy
impurity measures. In our example, let's consider a dataset
with three features and three classes:
from sklearn.datasets import make_classification
>>> nb_samples = 500
>>> X, Y = make_classification(n_samples=nb_samples,
n_features=3, n_informative=3, n_redundant=0,
n_classes=3, n_clusters_per_class=1)
• Let's first consider a classification with default
Gini impurity:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import
cross_val_score
>>> dt = DecisionTreeClassifier()
>>> print(cross_val_score(dt, X, Y,
scoring='accuracy', cv=10).mean())
0.970
• To export a trained tree, it is necessary to use
the built-in function export_graphviz():
• >>> dt.fit(X, Y)
>>> with open('dt.dot', 'w') as df:
• df = export_graphviz(dt, out_file=df,
• feature_names=['A','B','C'],
• class_names=['C1', 'C2', 'C3'])
• In this case, we have used A, B, and C as
feature names and C1, C2, and C3 as class
names. Once the file has been created, it's
possible converting to PDF using the
command-line tool:
>>> <Graphviz Home>bindot -Tpdf dt.dot -o
dt.pdf
• As you can see, there are two kinds of nodes:
• Nonterminal, which contains the splitting tuple (as
feature <= threshold) and a positive impurity measure
• Terminal, where the impurity measure is null and a
final target class is present
• In scikit-learn, it's possible to assess the Gini
importance of each feature after training a model:
>>> dt.feature_importances_
array([ 0.12066952,0.12532507,0.0577379 ,
0.14402762,0.14382398, 0.12418921,
0.14638565,0.13784106])
>>> np.argsort(dt.feature_importances_) array([2, 0, 5,
1, 7, 4, 3, 6], dtype=int64)
• The following figure shows a plot of the importances
• The most important features are 6, 3, 4, and 7, while
feature 2, for example, separates a very small
number of samples, and can be considered
noninformative for the classification task.
• On the other hand, it's easier to decide what
the maximum number of features to consider
at each split should be. The parameter
max_features can be used for this purpose:
• If it's a number, the value is directly taken into
account at each split
• If it's 'auto' or 'sqrt', the square root of the
number of features will be adopted
• If it's 'log2', the logarithm (base 2) will be used
• If it's 'None', all the features will be used (this
is the default value)
• min_samples_split, which specifies the minimum
number of samples to consider for a split. Some
examples are shown in the following snippet:
>>> cross_val_score(DecisionTreeClassifier(), X, Y,
scoring='accuracy', cv=10).mean()
0.77308070807080698
>>>cross_val_score(DecisionTreeClassifier(max_featur
es='auto'), X, Y, scoring='accuracy', cv=10).mean()
0.76410071007100711
>>>cross_val_score(DecisionTreeClassifier(min_sample
s_split=100), X, Y, scoring='accuracy', cv=10).mean()
0.72999969996999692

• finding the best parameters is generally a difficult
task, and the best way to carry it out is to perform a
grid search while including all the values that could
affect the accuracy.
• Using logistic regression on the previous set (only for
comparison), we get:
from sklearn.linear_model import LogisticRegression
>>> lr = LogisticRegression()
>>> cross_val_score(lr, X, Y, scoring='accuracy',
cv=10).mean() 0.9053368347338937
• we can compare an ROC curve for both linear
regression and decision trees:
>>> X, Y = make_classification(n_samples=nb_samples,
n_features=8, n_informative=6, n_redundant=2,
n_classes=2, n_clusters_per_class=4)
•
Ensemble learning
• When you want to purchase a new car, will you
walk up to the first car shop and purchase one
based on the advice of the dealer? It’s highly
unlikely.
• You would likely browser a few web portals
where people have posted their reviews and
compare different car models, checking for their
features and prices. You will also probably ask
your friends and colleagues for their opinion. In
short, you wouldn’t directly reach a conclusion,
but will instead make a decision considering the
opinions of other people as well.
• Ensemble models in machine learning operate
on a similar idea. They combine the decisions
from multiple models to improve the overall
performance.
• Let’s understand the concept of ensemble
learning with an example. Suppose you are a
movie director and you have created a short
movie on a very important and interesting
topic. Now, you want to take preliminary
feedback (ratings) on the movie before
making it public. What are the possible ways
by which you can do that?
• A: You may ask one of your friends to rate the movie for you.
Now it’s entirely possible that the person you have chosen
loves you very much and doesn’t want to break your heart by
providing a 1-star rating to the horrible work you have
created.
• B: Another way could be by asking 5 colleagues of yours to
rate the movie.
This should provide a better idea of the movie. This method
may provide honest ratings for your movie. But a problem still
exists. These 5 people may not be “Subject Matter Experts”
on the topic of your movie. Sure, they might understand the
cinematography, the shots, or the audio, but at the same time
may not be the best judges of dark humour.
• C: How about asking 50 people to rate the movie?
Some of which can be your friends, some of them can be your
colleagues and some may even be total strangers.
• The responses, in this case, would be more
generalized and diversified since now you have
people with different sets of skills. And as it turns
out – this is a better approach to get honest
ratings than the previous cases we saw.
• With these examples, you can infer that a diverse
group of people are likely to make better
decisions as compared to individuals. Similar is
true for a diverse set of models in comparison to
single models. This diversification in Machine
Learning is achieved by a technique called
Ensemble Learning.
Simple Ensemble Techniques
Max Voting
• The result of max voting would be something like this:
• Colleague 1 Colleague 2 Colleague 3Colleague 4Colleague 5
• Final rating 5 4 5 4 4 4
• You can consider this as taking the mode of all the
predictions.
Averaging- (5+4+5+4+4)/5 = 4.4 Final rating
Weighted Average
Weight-0.23 0.23 0.18 0.18 0.18
The result is calculated as [(5*0.23) + (4*0.23) +
(5*0.18) + (4*0.18) + (4*0.18)] = 4.41.
• Bagging
• Bootstrapping is a sampling technique in
which we create subsets of observations from
the original dataset, with replacement. The
size of the subsets is the same as the size of
the original set.
• Bagging (or Bootstrap Aggregating) technique
uses these subsets (bags) to get a fair idea of
the distribution (complete set). The size of
subsets created for bagging may be less than
the original set.
• Multiple subsets are created from the original
dataset, selecting observations with
replacement.
• A base model (weak model) is created on each
of these subsets.
• The models run in parallel and are
independent of each other.
• The final predictions are determined by
combining the predictions from all the models.
• Bagged (or Bootstrap) trees: In this case, the
ensemble is built completely. The training
process is based on a random selection of the
splits and the predictions are based on a majority
vote. Random forests are an example of bagged
tree ensembles.
• Boosting
• Before we go further, here’s another question
for you: If a data point is incorrectly predicted by
the first model, and then the next (probably all
models), will combining the predictions provide
better results? Such situations are taken care of
by boosting.
• Boosting is a sequential process, where each
subsequent model attempts to correct the errors
of the previous model. The succeeding models
are dependent on the previous model. Let’s
understand the way boosting works in the below
steps.
• A subset is created from the original dataset.
• Initially, all data points are given equal
weights.
• A base model is created on this subset.
• This model is used to make predictions on the
whole dataset.
• Errors are calculated using the actual values
and predicted values.
• The observations which are incorrectly
predicted, are given higher weights.
(Here, the three misclassified blue-plus points
will be given higher weights)
• Another model is created and predictions are
made on the dataset.
(This model tries to correct the errors from
the previous model)
• Similarly, multiple models are created, each
correcting the errors of the previous model.
• The final model (strong learner) is the weighted
mean of all the models (weak learners).
• Thus, the boosting algorithm combines a number of weak

learners to form a strong learner. The individual models
would not perform well on the entire dataset, but they work
well for some part of the dataset. Thus, each model actually
boosts the performance of the ensemble.
Algorithms based on Bagging and Boosting
• Bagging and Boosting are two of the most commonly
used techniques in machine learning.
• Bagging algorithms:
• Bagging meta-estimator
• Random forest
• Boosting algorithms:
• AdaBoost
• GBM
• XGBM
• Light GBM
• CatBoost
Random Forest
Random Forest is another ensemble machine
learning algorithm that follows the bagging
technique. It is an extension of the bagging
estimator algorithm.
The base estimators in random forest are
decision trees.
Unlike bagging meta estimator, random forest
randomly selects a set of features which are
used to decide the best split at each node of
the decision tree.
• Looking at it step-by-step, this is what a
random forest model does:
• Random subsets are created from the original
dataset (bootstrapping).
• At each node in the decision tree, only a
random set of features are considered to
decide the best split.
• A decision tree model is fitted on each of the
subsets.
• The final prediction is calculated by averaging
the predictions from all decision trees.
• To sum up, Random forest randomly selects
data points and features, and builds multiple
trees (Forest) .
• from sklearn.ensemble import
RandomForestClassifier model=
RandomForestClassifier(random_state=1)
model.fit(x_train, y_train)
model.score(x_test,y_test)
0.77297297297297296
• You can see feature importance by
using model.feature_importances_ in random
forest.
for i, j in sorted(zip(x_train.columns, model.feature_importances_)):
print(i, j)
The result is as below:
ApplicantIncome 0.180924483743
CoapplicantIncome 0.135979758733
Credit_History 0.186436670523 . . .
Property_Area_Urban 0.0167025290557
Self_Employed_No 0.0165385567137
Self_Employed_Yes 0.0134763695267
from sklearn.ensemble import RandomForestRegressor

model= RandomForestRegressor() model.fit(x_train, y_train)
• Parameters
• n_estimators:
– It defines the number of decision trees to be created in a random forest.
– Generally, a higher number makes the predictions stronger and more stable, but a very large number can result in higher
training time.
• criterion:
– It defines the function that is to be used for splitting.
– The function measures the quality of a split for each feature and chooses the best split.
• max_features :
– It defines the maximum number of features allowed for the split in each decision tree.
– Increasing max features usually improve performance but a very high number can decrease the diversity of each tree.
• max_depth:
– Random forest has multiple decision trees. This parameter defines the maximum depth of the trees.
• min_samples_split:
– Used to define the minimum number of samples required in a leaf node before a split is attempted.
– If the number of samples is less than the required number, the node is not split.
• min_samples_leaf:
– This defines the minimum number of samples required to be at a leaf node.
– Smaller leaf size makes the model more prone to capturing noise in train data.
• max_leaf_nodes:
– This parameter specifies the maximum number of leaf nodes for each tree.
– The tree stops splitting when the number of leaf nodes becomes equal to the max leaf node.
• n_jobs:
– This indicates the number of jobs to run in parallel.
– Set value to -1 if you want it to run on all cores in the system.
• random_state:
– This parameter is used to define the random selection.
– It is used for comparison between various models.
AdaBoost
• Adaptive boosting or AdaBoost is one of the
simplest boosting algorithms. Usually,
decision trees are used for modelling. Multiple
sequential models are created, each
correcting the errors from the last model.
• AdaBoost assigns weights to the observations
which are incorrectly predicted and the
subsequent model works to predict these
values correctly.
• Below are the steps for performing the AdaBoost algorithm:
1.Initially, all observations in the dataset are

given equal weights.
2.A model is built on a subset of data.
3.Using this model, predictions are made on
the whole dataset.
4.Errors are calculated by comparing the
predictions and actual values.
5.While creating the next model, higher
weights are given to the data points which
were predicted incorrectly.
6.Weights can be determined using the error
value. For instance, higher the error more is
the weight assigned to the observation.
7.This process is repeated until the error
function does not change, or the maximum
limit of the number of estimators is reached.
from sklearn.ensemble import
AdaBoostClassifier model =
AdaBoostClassifier(random_state=1)
model.fit(x_train, y_train)
0.81081081081081086
AdaBoostRegressor model =
AdaBoostRegressor() model.fit(x_train,
y_train) model.score(x_test,y_test)
• Parameters
• base_estimators:
– It helps to specify the type of base estimator, that is, the machine learning algorithm to
be used as base learner.
• n_estimators:
– It defines the number of base estimators.
– The default value is 10, but you should keep a higher value to get better performance.
• learning_rate:
– This parameter controls the contribution of the estimators in the final combination.
– There is a trade-off between learning_rate and n_estimators.
• max_depth:
– Defines the maximum depth of the individual estimator.
– Tune this parameter for best performance.
• n_jobs
– Specifies the number of processors it is allowed to use.
– Set value to -1 for maximum processors allowed.
• random_state :
– An integer value to specify the random data split.
– A definite value of random_state will always produce same results if given with same
parameters and training data.
• Gradient Boosting (GBM)
• Gradient Boosting or GBM is another ensemble
machine learning algorithm that works for both
regression and classification problems. GBM
uses the boosting technique, combining a
number of weak learners to form a strong
learner. Regression trees used as a base
learner, each subsequent tree in series is built
on the errors calculated by the previous tree.
• We will use a simple example to understand
the GBM algorithm. We have to predict the age
of a group of people using the below data:
• The mean age is assumed to be the predicted value
for all observations in the dataset.
• The errors are calculated using this mean prediction
and actual values of age.
• A tree model is created using the errors calculated
above as target variable. Our objective is to find the
best split to minimize the error.
• The predictions by this model are combined with the
predictions 1.
• This value calculated above is the new prediction.

• New errors are calculated using this predicted value
and actual value.
• Steps 2 to 6 are repeated till the maximum number of
iterations is reached (or error function does not
change).
GradientBoostingClassifier model=
GradientBoostingClassifier(learning_rate=0.01,random
_state=1) model.fit(x_train, y_train)
0.81621621621621621
• from sklearn.ensemble import GradientBoostingRegressor
model= GradientBoostingRegressor() model.fit(x_train,
y_train) model.score(x_test,y_test)
• Parameters
• min_samples_split
– Defines the minimum number of samples (or observations) which are required in a node to be considered for splitting.
– Used to control over-fitting. Higher values prevent a model from learning relations which might be highly specific to the
particular sample selected for a tree.
• min_samples_leaf
– Defines the minimum samples required in a terminal or leaf node.
– Generally, lower values should be chosen for imbalanced class problems because the regions in which the minority class will be
in the majority will be very small.
• min_weight_fraction_leaf
– Similar to min_samples_leaf but defined as a fraction of the total number of observations instead of an integer.
• max_depth
– The maximum depth of a tree.
– Used to control over-fitting as higher depth will allow the model to learn relations very specific to a particular sample.
– Should be tuned using CV.
• max_leaf_nodes
– The maximum number of terminal nodes or leaves in a tree.
– Can be defined in place of max_depth. Since binary trees are created, a depth of ‘n’ would produce a maximum of 2^n leaves.
– If this is defined, GBM will ignore max_depth.
• max_features
– The number of features to consider while searching for the best split. These will be randomly selected.
– As a thumb-rule, the square root of the total number of features works great but we should check up to 30-40% of the total
number of features.
– Higher values can lead to over-fitting but it generally depends on a case to case scenario.
• https://www.analyticsvidhya.com/blog/2018/06/comprehensive-guide-for-ensemble-models/
Clustering Fundamental
• Clustering is the task of dividing the
population or data points into a number of
groups such that data points in the same
groups are more similar to other data points
in the same group than those in other groups.
• In simple words, the aim is to segregate
groups with similar traits and assign them into
clusters.
• Let’s understand this with an example.
Suppose, you are the head of a rental store
and wish to understand preferences of your
costumers to scale up your business.
• Is it possible for you to look at details of each
costumer and devise a unique business
strategy for each one of them? Definitely not.
But, what you can do is to cluster all of your
costumers into say 10 groups based on their
purchasing habits and use a separate strategy
for costumers in each of these 10 groups. And
this is what we call clustering.
• Broadly speaking, clustering can be divided into
two subgroups :
• Hard Clustering: In hard clustering, each data point
either belongs to a cluster completely or not. For
example, in the above example each customer is
put into one group out of the 10 groups.
• Soft Clustering: In soft clustering, instead of putting
each data point into a separate cluster, a
probability or likelihood of that data point to be in
those clusters is assigned. For example, from the
above scenario each costumer is assigned a
probability to be in either of 10 clusters of the
retail store.
Types of clustering algorithms
• Since the task of clustering is subjective, the
means that can be used for achieving this goal
are plenty. Every methodology follows a
different set of rules for defining the
‘similarity’ among data points.
• In fact, there are more than 100 clustering
algorithms known. But few of the algorithms
are used popularly, let’s look at them in detail:
• Connectivity models: As the name suggests, these
models are based on the notion that the data points
closer in data space exhibit more similarity to each
other than the data points lying farther away. These
models can follow two approaches. In the first
approach, they start with classifying all data points into
separate clusters & then aggregating them as the
distance decreases. In the second approach, all data
points are classified as a single cluster and then
partitioned as the distance increases.
• Also, the choice of distance function is subjective. These
models are very easy to interpret but lacks scalability
for handling big datasets. Examples of these models are
hierarchical clustering algorithm and its variants.
• Centroid models: These are iterative
clustering algorithms in which the notion of
similarity is derived by the closeness of a data
point to the centroid of the clusters. K-Means
clustering algorithm is a popular algorithm
that falls into this category. In these models,
the no. of clusters required at the end have to
be mentioned beforehand, which makes it
important to have prior knowledge of the
dataset.
• These models run iteratively to find the local
optima.
• Distribution models: These clustering models are
based on the notion of how probable is it that all
data points in the cluster belong to the same
distribution (For example: Normal, Gaussian).
• These models often suffer from overfitting.
• A popular example of these models is Expectation-
maximization algorithm which uses multivariate
normal distributions.
• Density Models: These models search the data space
for areas of varied density of data points in the data
space. It isolates various different density regions
and assign the data points within these regions in the
same cluster. Popular examples of density models
are DBSCAN and OPTICS.
• Clustering basics
Let's consider a dataset of points:
We assume that it's possible to find a criterion

(not unique) so that each sample can be
associated with a specific group:
Conventionally, each group is called a cluster

and the process of finding the function G is
called clustering.
• we're going to discuss hard clustering
techniques, where each element must belong
to a single cluster.
• The alternative approach, called soft clustering
(or fuzzy clustering), is based on a
membership score that defines how much the
elements are "compatible" with each cluster.
The generic clustering function becomes:
• A vector mi represents the relative

membership of xi, and it's often normalized as
a probability distribution.
• K Means Clustering
• K means is an iterative clustering algorithm
that aims to find local maxima in each
iteration. This algorithm works in these 5
steps
• Specify the desired number of clusters K : Let
us choose k=2 for these 5 data points in 2-D
space.
• Randomly assign each data point to a cluster : Let’s
assign three points in cluster 1 shown using red color
and two points in cluster 2 shown using grey color.
• Compute cluster centroids : The centroid of data

points in the red cluster is shown using red cross and
those in grey cluster using grey cross.
• Re-assign each point to the closest cluster
centroid : Note that only the data point at the
bottom is assigned to the red cluster even
though its closer to the centroid of grey cluster.
Thus, we assign that data point into grey cluster
• Re-compute cluster centroids : Now, re-computing
the centroids for both the clusters.
• Repeat steps 4 and 5 until no improvements are
possible : Similarly, we’ll repeat the 4th and
5th steps until we’ll reach global optima. When
there will be no further switching of data points
between two clusters for two successive repeats. It
will mark the termination of the algorithm if not
explicitly mentioned.
• k-means algorithm is based on the (strong) initial
condition to decide the number of clusters through
the assignment of k initial centroids or means:
• Let's consider a simple example with a dummy
dataset:
• from sklearn.datasets import make_blobs
• nb_samples = 1000
• X, _ = make_blobs(n_samples=nb_samples,
n_features=2, centers=3, cluster_std=1.5)
• We expect to have three clusters with bidimensional
features and a partial overlap due to the standard
deviation of each blob. In our example, we won't use
the Y variable (which contains the expected cluster)
because we want to generate only a set of locally
coherent points to try our algorithms.
• The resultant plot is shown in the following figure:
from sklearn.cluster import KMeans
>>> km = KMeans(n_clusters=3)
>>> km.fit(X)
• KMeans(algorithm='auto', copy_x=True,
init='k-means++', max_iter=300,
n_clusters=3, n_init=10, n_jobs=1,
precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
>>> print(km.cluster_centers_)
[[ 1.39014517,1.38533993]
[ 9.78473454,6.1946332 ]
[-5.47807472,3.73913652]]
• Replotting the data using three different markers, it's
possible to verify how k-means successfully
separated the data:
• k-means can produce good results, but there are several
situations when the expected clustering is impossible and
letting k-means find out the centroid can lead to completely
wrong solutions.
• Let's consider the case of concentric circles. scikit-learn
provides a built-in function to generate such datasets:
• from sklearn.datasets import make_circles
>>> X, Y = make_circles(n_samples=nb_samples,
noise=0.05)
• The plot of this dataset is shown in the following figure:
>>> km.fit(X)
KMeans(algorithm='auto', copy_x=True, init='k-
means++', max_iter=300, n_clusters=2,
n_init=10, n_jobs=1,
precompute_distances='auto',
random_state=None, tol=0.0001, verbose=0)
• We get the separation shown in the following figure:
• As expected, k-means converged on the two centroids in the
middle of the two half-circles, and the resulting clustering is quite
different from what we expected. Moreover, if the samples must
be considered different according to the distance from the
common center, this result will lead to completely wrong
predictions.
Finding the optimal number of clusters
• One of the most common disadvantages of k-
means is related to the choice of the optimal
number of clusters. An excessively small value
will determine large groupings that contain
heterogeneous elements, while a large number
leads to a scenario where it can be difficult to
identify the differences among clusters.
• Therefore, we're going to discuss some
methods that can be employed to determine
the appropriate number of splits and to
evaluate the corresponding performance.
• Optimizing the inertia
• The first method is based on the assumption
that an appropriate number of clusters must
produce a small inertia. However, this value
reaches its minimum (0.0) when the number of
clusters is equal to the number of samples;
therefore, we can't look for the minimum, but
for a value which is a trade-off between the
inertia and the number of clusters.
• Let's suppose we have a dataset of 1,000 elements.
We can compute and collect the inertias (scikit-learn
stores these values in the instance variable inertia_)
for a different number of clusters:
>>> nb_clusters = [2, 3, 5, 6, 7, 8, 9, 10]
>>> inertias = []
>>> for n in nb_clusters:
>>> km = KMeans(n_clusters=n)
>>> km.fit(X)
>>> inertias.append(km.inertia_)
• As you can see, there's a dramatic reduction
between 2 and 3 and then the slope starts flattening.
We want to find a value that, if reduced, leads to a
great inertial increase and, if increased, produces a
very small inertial reduction.
• Therefore, a good choice could be 4 or 5, while
greater values are likely to produce unwanted
intracluster splits (till the extreme situation where
each point becomes a single cluster).
• This method is very simple, and can be employed as
the first approach to determine a potential range.
The next strategies are more complex, and can be
used to find the final number of clusters.
• Silhouette score
• The silhouette score is based on the principle
of "maximum internal cohesion and maximum
cluster separation". In other words, we would
like to find the number of clusters that
produce a subdivision of the dataset into
dense blocks that are well separated from
each other.
• After defining a distance metric (Euclidean is
normally a good choice), we can compute the
average intracluster distance for each
element:
• We can also define the average nearest-cluster
distance (which corresponds to the lowest
intercluster distance):
• The silhouette score for an element xi is defined as:

• This value is bounded between -1 and 1, with
the following interpretation:
• A value close to 1 is good (1 is the best
condition) because it means that a(xi) << b(xi)
• A value close to 0 means that the difference
between intra and inter cluster measures is
almost null and therefore there's a cluster
overlap
• A value close to -1 means that the sample has
been assigned to a wrong cluster because a(xi)
>> b(xi)
• from sklearn.metrics import silhouette_score
>>> nb_clusters = [2, 3, 5, 6, 7, 8, 9, 10]
>>> avg_silhouettes = []
>>> for n in nb_clusters:
>>> Y = km.fit_predict(X)
>>>avg_silhouettes.append(silhouette_score(X,
Y))
• The corresponding plot is shown in the
following figure:
• The best value is 3 (which is very close to 1.0), however,
bearing in mind the previous method, 4 clusters provide
a smaller inertia, together with a reasonable silhouette
score. Therefore, a good choice could be 4 instead of 3.
• However, the decision between 3 and 4 is not
immediate and should be evaluated by also considering
the nature of the dataset.
• The silhouette score indicates that there are 3 dense
agglomerates, but the inertia diagram suggests that one of them
(at least) can probably be split into two clusters.
• To have a better understanding of how the clustering is working,
it's also possible to graph the silhouette plots, showing the sorted
score for each sample in all clusters. In the following snippet we
create the plots for a number of clusters equal to 2, 3, 4, and 8:
from sklearn.metrics import silhouette_samples
>>> fig, ax = subplots(2, 2, figsize=(15, 10))

>>> nb_clusters = [2, 3, 4, 8]
>>> mapping = [(0, 0), (0, 1), (1, 0), (1, 1)]

>>> for i, n in enumerate(nb_clusters):
>>> Y = km.fit_predict(X)

>>> silhouette_values = silhouette_samples(X, Y)
>>> ax[mapping[i]].set_xticks([-0.15, 0.0, 0.25, 0.5, 0.75, 1.0])
>>> ax[mapping[i]].set_yticks([])
>>> ax[mapping[i]].set_title('%d clusters' % n)
>>> ax[mapping[i]].set_xlim([-0.15, 1])
>>> ax[mapping[i]].grid()
>>> y_lower = 20

>>> for t in range(n):
>>> ct_values = silhouette_values[Y == t]
>>> ct_values.sort()
>>> y_upper = y_lower + ct_values.shape[0]

>>> color = cm.Accent(float(t) / n)
>>> ax[mapping[i]].fill_betweenx(np.arange(y_lower, y_upper), 0,
>>> ct_values, facecolor=color,
edgecolor=color)
>>> y_lower = y_upper + 20
• The silhouette coefficients for each sample are computed using the
function silhouette_values (which are always bounded between -1 and 1).
In this case, we are limiting the graph between -0.15 and 1 because there
are no smaller values. However, it's important to check the whole range
before restricting it.
• The width of each silhouette is proportional to the number of samples
belonging to a specific cluster, and its shape is determined by the scores
of each sample. An ideal plot should contain homogeneous and long
silhouettes without peaks (they must be similar to trapezoids rather than
triangles) because we expect to have a very low score variance among
samples in the same cluster. For 2 clusters, the shapes are acceptable, but
one cluster has an average score of 0.5, while the other has a value
greater than 0.75; therefore, the first cluster has a low internal coherence.
A completely different situation is shown in the plot corresponding to 8
clusters. All the silhouettes are triangular and their maximum score is
slightly greater than 0.5. It means that all the clusters are internally
coherent, but the separation is unacceptable. With 3 clusters, the plot is
almost perfect, except for the width of the second silhouette. Without
further metrics, we could consider this number as the best choice
(confirmed also by the average score), but the inertia is lower for a higher
numbers of clusters. With 4 clusters, the plot is slightly worse, with two
silhouettes having a maximum score of about 0.5. This means that two
clusters are perfectly coherent and separated, while the remaining two
are rather coherent, but they aren't probably well separated. Right now,
our choice should be made between 3 and 4. The next methods will help
us in banishing all doubts.
DBSCAN
• Density-Based Spatial Clustering of Applications
with Noise is a powerful algorithm that can easily
solve non-convex problems where k-means fails.
• These methods consider the clusters as the dense
region having some similarity and different from the
lower dense region of the space. These methods
have good accuracy and ability to merge two clusters
• The idea is simple: A cluster is a high-density area
(there are no restrictions on its shape) surrounded
by a low-density one. This statement is generally
true, and doesn't need an initial declaration about
the number of expected clusters.
• The procedure starts by analyzing a small area
(formally, a point surrounded by a minimum
number of other samples). If the density is
enough, it is considered part of a cluster.
• At this point, the neighbors are taken into
account. If they also have a high density, they
are merged with the first area; otherwise,
they determine a topological separation.
• When all the areas have been scanned, the
clusters have also been determined because
they are islands surrounded by empty space.
• scikit-learn allows us to control this procedure
with two parameters:
• eps: Responsible for defining the maximum
distance between two neighbors. Higher
values will aggregate more points, while
smaller ones will create more clusters.
• min_samples: This determines how many
surrounding points are necessary to define an
area (also known as the core-point).
• Let's try DBSCAN with a very hard clustering
problem, called half-moons. The dataset can
be created using a built-in function:
from sklearn.datasets import make_moons
>>> X, Y = make_moons(n_samples=nb_samples,
noise=0.05)
• Just to understand, k-means will cluster by finding
the optimal convexity, and the result is shown in the
following figure:
• Let's try it with DBSCAN (with eps set to 0.1
and the default value of 5 for min_samples):
from sklearn.cluster import DBSCAN
>>> dbs = DBSCAN(eps=0.1)
>>> Y = dbs.fit_predict(X)
In a different manner than other
implementations, DBSCAN predicts the label
during the training process, so we already
have an array Y containing the cluster assigned
to each sample. In the following figure, there's
a representation with two different markers:
As you can see, the accuracy is very high and only three isolated points are
misclassified (in this case, we know their class, so we can use this term even if it's
a clustering process). However, by performing a grid search, it's easy to find the
best values that optimize the clustering process.
• Spectral clustering
• Spectral clustering has become increasingly
popular due to its simple implementation and
promising performance in many graph-based
clustering. It can be solved efficiently by
standard linear algebra software, and very
often outperforms traditional algorithms such
as the k-means algorithm.
• Spectral clustering is a more sophisticated
approach based on a symmetric affinity
matrix:
• Here, each element aij represents a measure
of affinity between two samples. The most
diffused measures (also supported by scikit-
learn) are radial basis function and nearest
neighbors.
• However, any kernel can be used if it
produces measures that have the same
features of a distance (non-negative,
symmetric, and increasing).
• To perform a spectral clustering we need 3 main
steps:
• Create a similarity graph between our N objects to
cluster.
• Compute the first k eigenvectors of its Laplacian
matrix to define a feature vector for each object.
• Run k-means on these features to separate objects
into k classes.
Step 1:
• A nice way of representing a set of data
points x1, . . . x N is in form of the similarity
graph G=(V,E).
• There are different ways to construct a graph
representing the relationships between data
points:
• ε-neighborhood graph: Each vertex is
connected to vertices falling inside a ball of
radius ε where ε is a real value that has to be
tuned in order to catch the local structure of
data.
• k-nearest neighbor graph: Each vertex is
connected to its k-nearest neighbors where k
is an integer number which controls the local
relationships of data.
• In the above example we drawn 3 clusters: two
“moons” and a Gaussian.
• In the ε-neighborhood graph, we can see that it is
difficult to choose a useful parameter ε. With ε = 0.3
as in the figure, the points on the middle moon are
already very tightly connected, while the points in
the Gaussian are barely connected.
• This problem always occurs if we have data “on
different scales”, that is the distances between data
points are different in different regions of the space.
However, the k-nearest neighbor graph, can connect
points “on different scales”. We can see that points
in the low-density Gaussian are connected with
points in the high-density moon.
Step 2:
• Now that we have our graph, we need to form its
associated Laplacian matrix.
• N.B: The main tools for spectral clustering are graph
Laplacian matrices.
• All we have to do now is to compute the
eigenvectors u_ j of L .
• Step 3:
• Run k-means :
• How to choose k ?
• By projecting the points into a non-linear
embedding and analyzing the eigenvalues of
the Laplacian matrix one can deduce the
number of clusters present in the data. When
the similarity graph is not fully connected, the
multiplicity of the eigenvalue λ = 0 gives us an
estimation of k.
• https://towardsdatascience.com/spectral-clustering-for-
beginners-d08b7d25b4d8
• from sklearn.cluster import SpectralClustering
>>> Yss = []
>>> gammas = np.linspace(0, 12, 4)
>>> for gamma in gammas:
sc = SpectralClustering(n_clusters=2,
affinity='rbf', gamma=gamma)
Yss.append(sc.fit_predict(X))
In this algorithm, we need to specify how many
clusters we want, so we set the value to 2. The
resulting plots are shown in the following
figure:
As you can see, when the scaling factor gamma is
increased the separation becomes more accurate;
however, considering the dataset, using the nearest
neighbors kernel is not necessary in any search:
>>> sc = SpectralClustering(n_clusters=2,
affinity='nearest_neighbors')
>>> Ys = sc.fit_predict(X)
The resulting plot is shown in the following figure:
• As for many other kernel-based methods, spectral clustering
needs a previous analysis to detect which kernel can provide the
best values for the affinity matrix.
• Evaluation methods based on the ground truth
• we present some evaluation methods that require the
knowledge of the ground truth. This condition is not
always easy to obtain because clustering is normally
applied as an unsupervised method; however, in some
cases, the training set has been manually (or automatically)
labeled, and it's useful to evaluate a model before predicting the
clusters of new samples.
• Homogeneity- we have defined the concepts of
entropy H(X) and conditional entropy H(X|Y), which
measures the uncertainty of X given the knowledge of
Y. Therefore, if the class set is denoted as C and the
cluster set as K, H(C|K) is a measure of the uncertainty
in determining the right class after having clustered the
dataset.
• To have a homogeneity score, it's necessary to
normalize this value considering the initial entropy of
the class set H(C):
• In scikit-learn, there's the built-in function

homogeneity_score() that can be used to compute this
value. For this and the next few examples, we assume
that we have a labeled dataset X (with true labels Y):
• from sklearn.metrics import homogeneity_score
• >>> km = KMeans(n_clusters=4)
• >>> Yp = km.fit_predict(X)
• >>> print(homogeneity_score(Y, Yp)) 0.806560739827
• A value of 0.8 means that there's a residual uncertainty of
about 20% because one or more clusters contain some points
belonging to a secondary class.
• Completeness
• A complementary requirement is that each sample
belonging to a class is assigned to the same cluster.
This measure can be determined using the
conditional entropy H(K|C), which is the uncertainty
in determining the right cluster given the knowledge
of the class. Like for the homogeneity score, we need
to normalize this using the entropy H(K):
•
• We can compute this score (on the same dataset)
using the function
• completeness_score():
• from sklearn.metrics import completeness_score
• >>> km = KMeans(n_clusters=4)
• >>> Yp = km.fit_predict(X)
• >>> print(completeness_score(Y, Yp))
0.807166746307
• Also, in this case, the value is rather high, meaning
that the majority of samples belonging to a class
have been assigned to the same cluster. This value
can be improved using a different number of clusters
or changing the algorithm.
• Adjusted rand index
• The adjusted rand index measures the similarity
between the original class partitioning (Y) and the
clustering. Bearing in mind the same notation adopted
in the previous scores, we can define:
• a: The number of pairs of elements belonging to the
same partition in the class set
• C and to the same partition in the clustering set K
• b: The number of pairs of elements belonging to
different partitions in the class set C and to different
partitions in the clustering set K
• If we total number of samples in the dataset is n, the
rand index is defined as:
• The Corrected for Chance version is the adjusted rand index,
defined as follows:
• We can compute the adjusted rand score using the function

adjusted_rand_score():
from sklearn.metrics import adjusted_rand_score
>>> Yp = km.fit_predict(X)
>>> print(adjusted_rand_score(Y, Yp)) 0.831103137285
• As the adjusted rand score is bounded
between -1.0 and 1.0, with negative values
representing a bad situation (the assignments
are strongly uncorrelated), a score of 0.83
means that the clustering is quite similar to
the ground truth. Also, in this case, it's
possible to optimize this value by trying
different numbers of clusters or clustering
strategies.
• Meta Classifier- cover in Decesion Tree Topics
Bagging,Boosting,Random Forest
• What are ensemble methods in tree based modeling ?
• The literary meaning of word ‘ensemble’ is group. Ensemble
methods involve group of predictive models to achieve a
better accuracy and model stability. Ensemble methods are
known to impart supreme boost to tree based models.
• Like every other model, a tree based model also suffers
from the plague of bias and variance. Bias means, ‘how
much on an average are the predicted values different from
the actual value.’ Variance means, ‘how different will the
predictions of the model be at the same point if different
samples are taken from the same population’.
• You build a small tree and you will get a model with
low variance and high bias. How do you manage to
balance the trade off between bias and variance ?
• Normally, as you increase the complexity of your
model, you will see a reduction in prediction error
due to lower bias in the model. As you continue to
make your model more complex, you end up over-
fitting your model and your model will start suffering
from high variance.
• A champion model should maintain a balance
between these two types of errors. This is known as
the trade-off management of bias-variance
errors. Ensemble learning is one way to execute this
trade off analysis.
• What is Bagging? How does it work?
• Bagging is a technique used to reduce the variance of
our predictions by combining the result of
multiple classifiers modeled on different sub-samples
of the same data set. The following figure will make
it clearer:
• The steps followed in bagging are:
• Create Multiple DataSets:
– Sampling is done with replacement on the original
data and new datasets are formed.
– The new data sets can have a fraction of the columns
as well as rows, which are generally hyper-
parameters in a bagging model
– Taking row and column fractions less than 1 helps in
making robust models, less prone to overfitting
• Build Multiple Classifiers:
– Classifiers are built on each data set.
– Generally the same classifier is modeled on each
data set and predictions are made.
• Combine Classifiers:
– The predictions of all the classifiers are combined
using a mean, median or mode value depending
on the problem at hand.
– The combined values are generally more robust
than a single model.
• Note that, here the number of models built is not a
hyper-parameters. Higher number of models are
always better or may give similar performance than
lower numbers. It can be theoretically shown that
the variance of the combined predictions are
reduced to 1/n (n: number of classifiers) of the
original variance, under some assumptions.
• There are various implementations of bagging models. Random
forest is one of them and we’ll discuss it next.
• Random Forest
• Assume number of cases in the training set is N. Then, sample
of these N cases is taken at random but with replacement. This
sample will be the training set for growing the tree.
• If there are M input variables, a number m<M is specified such
that at each node, m variables are selected at random out of
the M. The best split on these m is used to split the node. The
value of m is held constant while we grow the forest.
• Each tree is grown to the largest extent possible and there is
no pruning.
• Predict new data by aggregating the predictions of the ntree
trees (i.e., majority votes for classification, average for
regression).
• Advantages of Random Forest
• This algorithm can solve both type of problems i.e.
classification and regression and does a decent
estimation at both fronts.
• One of benefits of Random forest which excites me
most is, the power of handle large data set with
higher dimensionality.
• It can handle thousands of input variables and
identify most significant variables so it is considered
as one of the dimensionality reduction methods.
Further, the model outputs Importance of
variable, which can be a very handy feature (on
some random data set).
• It has an effective method for estimating missing data and
maintains accuracy when a large proportion of the data are
missing.
• It has methods for balancing errors in data sets where classes are
imbalanced.
• The capabilities of the above can be extended to unlabeled data,
leading to unsupervised clustering, data views and outlier
detection.
• Random Forest involves sampling of the input data with
replacement called as bootstrap sampling. Here one third of the
data is not used for training and can be used to testing. These are
called the out of bag samples. Error estimated on these out of
bag samples is known as out of bag error. Study of error
estimates by Out of bag, gives evidence to show that the out-of-
bag estimate is as accurate as using a test set of the same size as
the training set. Therefore, using the out-of-bag error estimate
removes the need for a set aside test set.
• Disadvantages of Random Forest
• It surely does a good job at classification but not
as good as for regression problem as it does not
give precise continuous nature predictions. In
case of regression, it doesn’t predict beyond the
range in the training data, and that they may
over-fit data sets that are particularly noisy.
• Random Forest can feel like a black box
approach for statistical modelers – you have
very little control on what the model does. You
can at best – try different parameters and
random seeds!
• What is Boosting ? How does it work?
• Let’s understand this definition in detail by solving a problem of
spam email identification:
• How would you classify an email as SPAM or not? Like everyone
else, our initial approach would be to identify ‘spam’ and ‘not
spam’ emails using following criteria. If:
• Email has only one image file (promotional image), It’s a SPAM
• Email has only link(s), It’s a SPAM
• Email body consist of sentence like “You won a prize money of $
xxxxxx”, It’s a SPAM
• Email from our official domain “Analyticsvidhya.com” , Not a
SPAM
• Email from known source, Not a SPAM
• Above, we’ve defined multiple rules to classify an email into
‘spam’ or ‘not spam’. But, do you think these rules individually
are strong enough to successfully classify an email? No.
• Individually, these rules are not powerful enough to
classify an email into ‘spam’ or ‘not spam’. Therefore,
these rules are called as weak learner.
• To convert weak learner to strong learner, we’ll
combine the prediction of each weak learner using
methods like:
• Using average/ weighted average
• Considering prediction has higher vote
• For example: Above, we have defined 5 weak
learners. Out of these 5, 3 are voted as ‘SPAM’ and 2
are voted as ‘Not a SPAM’. In this case, by default,
we’ll consider an email as SPAM because we have
higher(3) vote for ‘SPAM’.
• Now we know that, boosting combines weak
learner a.k.a. base learner to form a strong
rule. An immediate question which should
pop in your mind is, ‘How boosting identify
weak rules?‘
• To find weak rule, we apply base learning (ML)
algorithms with a different distribution. Each
time base learning algorithm is applied, it
generates a new weak prediction rule. This is
an iterative process. After many iterations, the
boosting algorithm combines these weak rules
into a single strong prediction rule.
• For choosing the right distribution, here are the
following steps:
• Step 1: The base learner takes all the
distributions and assign equal weight or attention
to each observation.
• Step 2: If there is any prediction error caused by
first base learning algorithm, then we pay higher
attention to observations having prediction error.
Then, we apply the next base learning algorithm.
• Step 3: Iterate Step 2 till the limit of base learning
algorithm is reached or higher accuracy is
achieved.
Reference
• https://www.geeksforgeeks.org/naive-bayes-classifiers/
• https://www.cs.utah.edu/~piyush/teaching/15-9-slides.pdf
• https://blog.statsbot.co/ensemble-learning-d1dcd548e936
• https://www.analyticsvidhya.com/blog/2016/04/complete-tu
torial-tree-based-modeling-scratch-in-python/

Unit-5 Decision Trees and Ensemble Learning

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit-5 Decision Trees and Ensemble Learning

Uploaded by

Copyright:

Available Formats

Unit-5 Decision

Every vector is made up of m features, so each of them can be a good candidate

• Here, D is the whole dataset at the selected node,

• Here, the sum is always extended to all classes. This is a very

• This measure is based on information theory,

• Thus, the boosting algorithm combines a number of weak

from sklearn.ensemble import RandomForestRegressor

1.Initially, all observations in the dataset are

• This value calculated above is the new prediction.

We assume that it's possible to find a criterion

Conventionally, each group is called a cluster

• A vector mi represents the relative

• Compute cluster centroids : The centroid of data

• The silhouette score for an element xi is defined as:

• In scikit-learn, there's the built-in function

• We can compute the adjusted rand score using the function

You might also like