Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

4.1, 4.2, 4.

3 - COMBINING MULTIPLE LEARNERS

Generally, we develop machine learning models to solve a problem by using the given input data.
When we work on a single algorithm, we are unable to distinguish the performance of the model
for that particular statement, as there is nothing to compare it against. So, we feed the input data
to other machine learning algorithms and then compare them with each other to know which is
the best algorithm that suits the given problem. Every algorithm has its own mathematical
computation and significance to deal with a specific problem to bring out the best results at the
end.

Why do we combine models?

While dealing with a specific problem with a machine learning algorithm we sometimes fail,
because of the poor performance of the model. The algorithm that we have used may be well
suited to the model, but we still fail in getting better outcomes at the end. In this situation, we
might have many questions in our mind. How can we bring out better results from the model?
What are the steps to be taken further in the model development? What are the hidden techniques
that can help to develop an efficient model?

To overcome this situation there is a procedure called “Combining Models”, where we mix one
or two weaker machine learning models to solve a problem and get better outcomes. In machine
learning, the combining of models is done by using two approaches namely “Ensemble Models”
& “Hybrid Models”.

Ensemble Models use multiple machine learning algorithms to bring out better predictive results,
as compared to using a single algorithm. There are different approaches in Ensemble models to
perform a particular task. There is another model called Hybrid model that is flexible and helps
to create a more innovative model than an Ensemble model. While combining models we need to
check how strong or weak a particular machine learning model is, to deal with a particular
problem.

Ensemble learning combines multiple machine learning models into a single model. The aim is to
increase the performance of the model. Bagging aims to decrease variance, boosting aims to decrease
bias, and stacking aims to improve prediction accuracy. Bagging and boosting combine homogenous
weak learners.
Ensemble learning is one of the most powerful machine learning techniques that use the combined
output of two or more models/weak learners and solve a particular computational intelligence
problem. E.g., a Random Forest algorithm is an ensemble of various decision trees combined.

Ensemble learning is primarily used to improve the model performance, such as classification,
prediction, function approximation, etc. When designing a learning machine, we generally make
some choices like parameters of machine, training data, representation, etc. This implies some sort
of variance in performance. For example, in a classification setting, we can use a parametric
classifier or in a multilayer perceptron, we should also decide on the number of hiddenunits.

Each learning algorithm dictates a certain model that comes with a set of assumptions. This
inductive bias leads to error if the assumptions do not hold for the data.

Different learning algorithms have different accuracies. The no free lunch theorem asserts that no
single learning algorithm always achieves the best performance in any domain. They can be
combined to attain higher accuracy

• Data fusion is the process of fusing multiple records representing the same real-world object
into a single, consistent, and clean representation Fusion of data for improving prediction
accuracy and reliability is an important problem in machine learning.

• Combining different models is done to improve the performance of deep learning models.
Building a new model by combination requires less time, data, and computational resources. The
most common method to combine models is by averaging multiple models, where taking a
weighted average improves the accuracy.

Generating Diverse Learners:


Different Algorithms: We can use different learning algorithms to train different base-learners.
Different algorithms make different assumptions about the data and lead to different classifiers.

Different Hyper-parameters: We can use the same learning algorithm but use it with different
hyper-parameters.

Different Input Representations: Different representations make different characteristics


explicit allowing better identification.

Different Training Sets: Another possibility is to train different base-learners by different subsetsof
the training set.

1.1Model Combination Schemes


Different methods are used for generating final output for multiple base learners are Multiexpert
and multistage combination.
1. Multiexpert combination
Multiexpert combination methods have base-learners that work in parallel.
a) Global approach (learner fusion) given an input, all base-learners generate an output and all
these outputs are used, such as voting and stacking

b) Local approach (leamer selection) in mixture of experts, there is a gating model, which looks
at the input and chooses one (or very few) of the learners as responsible for generating the output.

2.Multistage combination

Multistage combination methods use a serial approach where the next multistage combination base-
learner is trained with or tested on only the instances where the previous base-learners are not
accurate enough.

 Let's assume that we want to construct a function that maps inputs to outputs from a set of
known Ntraininput-output pairs.

where x, Є X is a D dimensional feature input vector. yiЄ Y is the output.

Classification: When the output takes values in a discrete set of class labels Y = {c1;c2;…cx},
where K is the number of different classes. Regression consists in predicting continuous ordered
outputs, Y=R

1.2 Voting

 The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learners. Voting is an ensemble machine learning algorithm.

• For regression, a voting ensemble involves making a prediction that is the average of multiple
other regression models.

A Voting Classifier is a machine learning model that trains on an ensemble of numerous models
and predicts an output (class) based on their highest probability of chosen class as the output.

• It simply aggregates the findings of each classifier passed into Voting Classifier and
predicts the output class based on the highest majority of voting.

• The idea is instead of creating separate dedicated models and finding the accuracy for each
them, we create a single model which trains by these models and predicts output based on
their combined majority of voting for each output class.
Voting Classifier supports two types of votings.

Hard Voting: In hard voting, the predicted output class is a class with the highest majority of
votes i.e the class which had the highest probability of being predicted by each of the classifiers.
Suppose three classifiers predicted the output class(A, A, B), so here the majority predicted A as
output. Hence A will be the final prediction.

Soft Voting: In soft voting, the output class is the prediction based on the average of probability
given to that class.

• Suppose given some input to three models, the prediction probability for class A = (0.30,
0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for class A is 0.4333 and B is 0.3067,
the winner is clearly class A because it had the highest probability averaged by each
classifier.

In this methods, the first step is to create multiple classification/regression models using some
training dataset. Each base model can be created using different splits of the same training dataset
and same algorithm, or using the same dataset with different algorithms, or any other method.

 Learn multiple alternative definitions of a concept using different training data or different
learning algorithms. It combines decisions of multiple definitions,

Eg. using weighted voting


Figure shows general idea of Base-learners with model combiner
When combining multiple independent and diverse decisions each of which is at least more
accurate than random guessing, random errors cancel each other out, and correct decisions are
reinforced. Human ensembles are demonstrably better.
4.4, 4.5, 4.6, 4.7 - ENSEMBLE LEARNING

Introduction To Ensemble Learning

Ensemble Learning Techniques in Machine Learning, Machine learning models suffer bias
and/or variance. Bias is the difference between the predicted value and actual value by the
model. Bias is introduced when the model doesn’t consider the variation of data and creates a
simple model. The simple model doesn’t follow the patterns of data, and hence the model gives
errors in predicting training as well as testing data i.e. the model with high bias and high
variance. When the model follows even random quirks of data, as pattern of data, then the
model might do very well on training dataset i.e. it gives low bias, but it fails on test data and
gives high variance.

Therefore, to improve the accuracy (estimate) of the model, ensemble learning methods are
developed. Ensemble is a machine learning concept, in which several models are trained using
machine learning algorithms. It combines low performing classifiers (also called as weak
learners or base learner) and combine individual model prediction for the final prediction.

On the basis of type of base learners, ensemble methods can be categorized as homogeneous
and heterogeneous ensemble methods. If base learners are same, then it is a homogeneous
ensemble method. If base learners are different then it is a heterogeneous ensemble method.

Ensemble Learning Methods

Ensemble techniques are classified into three types:


1. Bagging

2. Boosting

3. Stacking

High-bias and High-variance Models

 A high-bias model results from not learning data well enough. It is not related to
the distribution of the data. Hence future predictions will be unrelated to the data
and thus incorrect.

 A high variance model results from learning the data too well. It varies with each
data point. Hence it is impossible to predict the next point accurately.

Both high bias and high variance models thus cannot generalize properly. Thus, weak
learners will either make incorrect generalizations or fail to generalize altogether. Because
of this, the predictions of weak learners cannot be relied on by themselves.

As we know from the bias-variance trade-off, an underfit model has high bias and low
variance, whereas an overfit model has high variance and low bias. In either case, there is
no balance between bias and variance. For there to be a balance, both the bias and variance
need to be low. Ensemble learning tries to balance this bias-variance trade-off by reducing
either the bias or the variance.

It aims to reduce the bias if we have a weak model with high bias and low
variance. Ensemble learning will aim to reduce the variance if we have a weak model with
high variance and low bias. This way, the resulting model will be much more balanced,
with low bias and variance. Thus, the resulting model will be known as a strong
learner. This model will be more generalized than the weak learners. It will thus be able
to make accurate predictions.

Monitoring Ensemble Learning Models

Ensemble learning improves a model’s performance in mainly three ways:

 By reducing the variance of weak learners

 By reducing the bias of weak learners,

 By improving the overall accuracy of strong learners.

Bagging is used to reduce the variance of weak learners. Boosting is used to reduce the
bias of weak learners. Stacking is used to improve the overall accuracy of strong learners.

Reducing Variance with Bagging

We use bagging for combining weak learners of high variance. Bagging aims to produce
a model with lower variance than the individual weak models. These weak learners are
homogenous, meaning they are of the same type.
Bagging is also known as Bootstrap aggregating. It consists of two steps: bootstrapping
and aggregation.

Bootstrapping

Involves resampling subsets of data with replacement from an initial dataset. In other
words, subsets of data are taken from the initial dataset. These subsets of data are called
bootstrapped datasets or, simply, bootstraps. Resampled ‘with replacement’ means an
individual data point can be sampled multiple times. Each bootstrap dataset is used to train
a weak learner.

Aggregating

The individual weak learners are trained independently from each other. Each learner
makes independent predictions. The results of those predictions are aggregated at the end
to get the overall prediction. The predictions are aggregated using either max voting or
averaging.

Max Voting

It is a commonly used for classification problems that consists of taking the mode of the
predictions (the most occurring prediction). It is called voting because like in election
voting, the premise is that ‘the majority rules’. Each model makes a prediction. A
prediction from each model counts as a single ‘vote’. The most occurring ‘vote’ is chosen
as the representative for the combined model.

Averaging

It is generally used for regression problems. It involves taking the average of the
predictions. The resulting average is used as the overall prediction for the combined model.
Steps of Bagging

The steps of bagging are as follows:

1. We have an initial training dataset containing n-number of instances.

2. We create a m-number of subsets of data from the training set. We take a subset
of N sample points from the initial dataset for each subset. Each subset is taken
with replacement. This means that a specific data point can be sampled more than
once.

3. For each subset of data, we train the corresponding weak learners independently.
These models are homogeneous, meaning that they are of the same type.

4. Each model makes a prediction.

5. The predictions are aggregated into a single prediction. For this, either max voting
or averaging is used.
Reducing Bias by Boosting

We use boosting for combining weak learners with high bias. Boosting aims to produce a
model with a lower bias than that of the individual models. Like in bagging, the weak
learners are homogeneous.

Boosting involves sequentially training weak learners. Here, each subsequent learner
improves the errors of previous learners in the sequence. A sample of data is first taken
from the initial dataset. This sample is used to train the first model, and the model makes
its prediction. The samples can either be correctly or incorrectly predicted. The samples
that are wrongly predicted are reused for training the next model. In this way, subsequent
models can improve on the errors of previous models.

Unlike bagging, which aggregates prediction results at the end, boosting aggregates the
results at each step. They are aggregated using weighted averaging.

Weighted averaging involves giving all models different weights depending on their
predictive power. In other words, it gives more weight to the model with the highest
predictive power. This is because the learner with the highest predictive power is
considered the most important.

Steps of Boosting
Boosting works with the following steps:

1. We sample m-number of subsets from an initial training dataset.

2. Using the first subset, we train the first weak learner.

3. We test the trained weak learner using the training data. As a result of the testing,
some data points will be incorrectly predicted.

4. Each data point with the wrong prediction is sent into the second subset of data,
and this subset is updated.

5. Using this updated subset, we train and test the second weak learner.

6. We continue with the following subset until the total number of subsets is reached.

7. We now have the total prediction. The overall prediction has already been
aggregated at each step, so there is no need to calculate it.

Improving Model Accuracy with Stacking

We use stacking to improve the prediction accuracy of strong learners. Stacking aims to
create a single robust model from multiple heterogeneous strong learners.

Stacking differs from bagging and boosting in that:

 It combines strong learners

 It combines heterogeneous models

 It consists of creating a Metamodel. A metamodel is a model created using a new


dataset.

Individual heterogeneous models are trained using an initial dataset. These models make
predictions and form a single new dataset using those predictions. This new data set is
used to train the metamodel, which makes the final prediction. The prediction is combined
using weighted averaging.

Because stacking combines strong learners, it can combine bagged or boosted models.
Steps of Stacking

The steps of Stacking are as follows:

1. We use initial training data to train m-number of algorithms.

2. Using the output of each algorithm, we create a new training set.

3. Using the new training set, we create a meta-model algorithm.

4. Using the results of the meta-model, we make the final prediction. The results are
combined using weighted averaging.

When to use Bagging vs Boosting vs Stacking?

If you want to reduce the overfitting or variance of your model, you use bagging and if
you are looking to reduce underfitting or bias, you use boosting. However, if you want to
increase predictive accuracy, use stacking.
Bagging and boosting both works with homogeneous weak learners. Stacking works using
heterogeneous solid learners.

All three of these methods can work with either classification or regression problems.

One disadvantage of boosting is that it is prone to variance or overfitting. It is thus not


advisable to use boosting for reducing variance. Boosting will do a worse job in reducing
variance as compared to bagging.

On the other hand, the converse is true. It is not advisable to use bagging to reduce bias or
underfitting. This is because bagging is more prone to bias and does not help reduce bias.

Stacked models have the advantage of better prediction accuracy than bagging or boosting.
But because they combine bagged or boosted models, they have the disadvantage of
needing much more time and computational power. If you are looking for faster results,
it’s advisable not to use stacking. However, stacking is the way to go if you’re looking for
high accuracy.

Conclusion

Bagging, boosting, and stacking are vital techniques in ensemble learning in machine
learning. They play a crucial role in enhancing model accuracy and mitigating the risks
associated with inaccurate predictions.

Ensemble learning combines multiple machine learning models into a single model. The
aim is to increase the performance of the model.

 Bagging aims to decrease variance, boosting aims to decrease bias, and stacking
aims to improve prediction accuracy.

 Bagging and boosting combine homogenous weak learners. Stacking combines


heterogeneous solid learners.

 Bagging trains models in parallel and boosting trains the models sequentially.
Stacking creates a meta-model.
4.8 UNSUPERVISED LEARNING

Unsupervised Machine Learning

There may be many cases in which we do not have labeled data and need to find the hidden
patterns from the given dataset. So, to solve such types of cases in machine learning, we need
unsupervised learning techniques.

What is Unsupervised Learning?

As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the human
brain while learning new things. It can be defined as:

Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.

Unsupervised learning cannot be directly applied to a regression or classification problem


because unlike supervised learning, we have the input data but no corresponding output data.
The goal of unsupervised learning is to find the underlying structure of dataset, group that
data according to similarities, and represent that dataset in a compressed format.

Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images.
Why use Unsupervised Learning?

Below are some main reasons which describe the importance of Unsupervised Learning:

o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.

Working of Unsupervised Learning

Working of unsupervised learning can be understood by the below diagram:


Here, we have taken an unlabeled input data, which means it is not categorized and
corresponding outputs are also not given. Now, this unlabeled input data is fed to the machine
learning model in order to train it. Firstly, it will interpret the raw data to find the hidden
patterns from the data and then will apply suitable algorithms such as k-means clustering,
Decision tree, etc.

Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.

Types of Unsupervised Learning Algorithm:

The unsupervised learning algorithm can be further categorized into two types of
problems:

o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set
of items that occurs together in the dataset. Association rule makes marketing strategy
more effective. Such as people who buy X item (suppose a bread) are also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.
Unsupervised Learning algorithms:

o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition

Advantages of Unsupervised Learning

o Unsupervised learning is used for more complex tasks as compared to supervised


learning because, in unsupervised learning, we don't have labeled input data.
o Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to
labeled data.

Disadvantages of Unsupervised Learning

o Unsupervised learning is intrinsically more difficult than supervised learning as it does


not have corresponding output.

o The result of the unsupervised learning algorithm might be less accurate as input data
is not labeled, and algorithms do not know the exact output in advance.
4.9 K-MEANS

K-means clustering is a type of unsupervised learning when we have unlabeled data


(i.e., data without defined categories or groups). Clustering refers to a collection of
data points based on specific similarities.

K-Means Algorithm

K-means aims to find groups in the data, with the number of groups represented by
the variable K. Based on the provided features, the algorithm works iteratively to
assign each data point to one of the “K” groups. Please note that the clustering of the
data points is by the similarity of features.

K-Means uses of Centroid Model algorithm. Centroid Models use the notion of data
points that are close to the center of clusters having a relation.

A target number referred to as ‘k’ is the number of centroids you need in the dataset.
A centroid is the cluster’s center, and each data point is assigned to a specific cluster.

The ‘means’ in the name refers to the averaging of the data, which helps to find the
centroid location.
The K-means algorithm aims to reduce the distance between the data points and their
cluster centroid.

What is a Centroid?

Centroid is a method of clustering that represents each cluster by a single mean


vector, known as the centroid. The centroid is the average of all points belonging to
the cluster. The K-means algorithm assigns data points to the cluster whose centroid
is nearest to the point. Throughout the algorithm’s iterative process, these centroids
are recalculated as the mean value of all points assigned to the cluster. Thereby
refining the cluster assignment with each iteration until the process converges or
meets a defined stopping criterion.

How Does the K-Means Algorithm Work?

K-means clustering works in the following steps:

1. Choosing the number of clusters (k) : This is often done using methods like
the elbow or silhouette method. This helps determine the optimal number of
clusters by evaluating the within-cluster sum of squares (WCSS) or silhouette
score.
2. Randomly selecting initial centroids: While centroids can be randomly
selected, some implementations of K-means use more sophisticated methods
for initialization, such as the k-means++ algorithm, which aims to spread out
the initial centroids before proceeding with the standard K-means optimization
iterations.
3. Assigning data points to the nearest centroid: The algorithm assigns each
data point to the cluster with the closest centroid, typically using the Euclidean
distance metric.
4. Optimizing centroids: After the assignment, the centroids are recalculated as
the mean of all points in their cluster, which may or may not result in the
centroid moving.
5. Iterative process: The assignment and update steps (3 and 4) are repeated
until a stopping criterion is met. This could be a fixed number of iterations, a
threshold of minimal changes in centroid positions, minimal changes in the
assignment of the data points to the clusters, or when there is no further
decrease in WCSS.

Please note that the quality of the cluster assignments is determined by the within -
cluster sum of squares (WCSS), which t he algorithm tries to minimize.

One of the challenges in K-means is choosing the right K, i.e., the number of clusters.
This is often done using methods like the elbow method. In the elbow method, the
WCSS is plotted against the number of clusters, and the “elbow point” of the curve
is chosen as the optimal K.

Stop point for K-Means clustering:

Common stopping criteria for K-means include:

 No change in the centroids (i.e., the posit ions of the centroids do not change
between iterations)
 No change in cluster assignments (i.e., data points remain in the same cluster)
 A predetermined number of iterations is reached
 The decrease in the within-cluster sum of squares (WCSS) between iterations
is below a certain threshold.

When Can We Manually Stop the Process?

 If the centroids of newly formed clusters are not changing, we can interpret
that the algorithm is not learning any new characteristics or patterns; acting as
a sign to stop the training.
 If we have trained the algorithm for multiple iterations, however, the data
points remain the same.
Choosing the Correct Number of Clusters, k.

Regarding my simple example above, I chose 2 clusters purely for the article.
However, you might be wondering how to choose the right number of clusters.

There are two methods that you can use to select the correct value of k:

1. Elbow Method
2. Silhouette Method

Elbow Method

The Elbow Method is a popular tool that applies k-means for a range of values of k,
and we plot it against the Sum of Squared Error, referred to as a scree plot.

The image below shows that the total Sum of Squared Error decreases as we increase
the number of Clusters. The Elbow is a point from where there is no significant
reduction in the Sum of the Squared Errors, even if the number of clusters increases
further.

Using the Yellow Brick Library, you can use the following Python code to calculate
the Elbow method:

from yellowbrick.cluster import KElbowVisualizer

The above Python statement imports the “KElbowVisualizer” class from the
“yellowbrick.cluster” module.
Yellowbrick is a machine learning visualization library in Python that extends
the scikit-learn API with visual analysis and diagnostic tools.

The “KElbowVisualizer” implements the “elbow” method of selecting the optimal


number of clusters for K-means clustering by fitting the model with a range of values
for “K“. It then visualizes the average within-cluster distances to help you select the
value of “K” where the improvement to the average distance falls off, i.e., the
“elbow” of the curve.

Silhouette Method

The Silhouette method is different from the Elbow method. It calculates the
Silhouette coefficient of every point. A Silhouette Coefficient, or the Silhouette
Score, is a metric to calculate the efficiency of a clustering technique, ranging from
the values -1 to 1.

The computation of the Silhouette Coefficient takes place for each data point by first
determining the average distance from the point to the other points in its cluster,
denoted as ‘a’. It then measures the average distance from the point to the points in
the nearest cluster that it is not a part of, denoted as ‘b’. The Silhouette Coefficient
is calculated using these average distances.

The formula for the Silhouette method:

S = b-a / max(a,b)

Simple K-means Clustering algorithm example

In this example, we will implement a K-means Clustering algorithm using Python and
Scikit-learn library to visualize a simple K-means clustering.

1. Import Libraries

# for reading and writing


import pandas as pd

# for efficient computations

import numpy as np

# data visualisation

import matplotlib.pyplot as plt

# Scikit-learn library

from sklearn.cluster import KMeans

%matplotlib inline

2. Generate Random Data

I have selected 150 data points to divide into two groups.

# Using np.random.rand to generate the random data

X= -2 * np.random.rand(150,2)

X1 = 1 + 2 * np.random.rand(50,2)

X[50:100, :] = X1

# Plot the randomly generate data on a simple graph

plt.scatter(X[ : , 0], X[ :, 1], s = 50)


plt.show()

3. Scikit-Learn

I will use the tools available in the Scikit-Learn library to process the randomly
generated data into clusters. This is where I am selecting my number of clusters. For
this example, I will be using two clusters:

# we also imported this library above

from sklearn.cluster import KMeans

# number of clusters, k = 2

Kmean = KMeans(n_clusters=2)

# Scikit-Learn KMeans tool which fits the data points to a meanKmean.fit(X)

4. Finding the Centroids

After selecting the number of clusters, ‘k’. Our next step was to find the Centroid. In
order to do this, we run this code:

# this will find the center of the clusters (the Centroid)

Kmean.cluster_centers_

The output:

array([[ 2.05811397, 2.10446134],

[-0.89414529, -1.0509698 ]])


5. Applying the Centroids

You can see the assignment of two centroids, one is Green, and the other is Orange.

plt.scatter(X[ : , 0], X[ : , 1], s =50)

plt.scatter(2.05811397, 2.10446134, s=200)

plt.scatter(-0.89414529, -1.0509698, s=200)

plt.show()

If you would like to test the algorithm or see which data points have been assigned
to which characteristic, you can do so by the following code:

Kmean.labels_

Output:

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,

1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

6. Testing

If you wish to test your K-means cluster, generate a sample, reshape it, and use
“Kmeans.predict”. This will assign the new data point to a centroid. Please chec k
the following code:
sample_test=np.array([-2.0,-2.0])

second_test=sample_test.reshape(1, -1)

Kmean.predict(second_test)

Output:

array([1], dtype=int32)

Advantages and Disadvantages of K -means

Advantages:

 Simplicity and Speed: K-means is easy to understand and implement. It is


computationally efficient, making it suitable for large datasets.
 Scalability: The algorithm can easily handle very large datasets, especially
when using variations like Mini-Batch K-Means.
 Adaptability: It can be adapted with various distance metrics (though the
standard is Euclidean distance).
 Well-studied: As one of the most used clustering algorithms, there are many
resources and community knowledge for troubleshooting and improving its
performance.
 Feature Space Reduction: It can reduce the feature space to help with data
visualization (e.g., identifying the centroids can simplify the complexity of the
data).

Disadvantages:

 Choosing k: Determining the number of clusters (k) is not straightforward and


often requires domain knowledge or addit ional methods like the elbow method
or silhouette analysis.
 Sensitivity to Initial Centroids: The final results can be sensit ive to the initial
random assignment of centroids, leading to different clustering results from
run to run.
 Sensitivity to Outliers: K-means clustering is sensitive to outliers, as a single
outlier can significantly shift the centroid of a cluster.
 Assumption of Spherical Clusters: It assumes that clusters are spherical and
evenly sized, which might not always be true. Consequently, it does not work
well with complex geometric shapes (e.g., elongated clusters or manifolds).
 Difficulty with Different Densities: K-means clustering struggles when
clusters have varying densities and different numbers of points.
 Hard Clustering: It assigns each data point to a single cluster (hard
clustering), rather than providing probabilities of belonging to various clusters
(soft clustering).
 Local Optima: K-means may converge to a local optimum, which is not
necessarily the best solution. Repeated runs with different initializations are
often necessary to get a satisfactory result.
 In practice, these disadvantages mean that while K-means is a good starting
point for clustering analysis, it may not be suitable for all datasets o r problems,
and we may require other clustering algorithms for more nuanced analysis.

Applications of K-Means Clustering

K-means clustering is a versatile algorithm with a wide array of applications across


various fields. Here are some detailed examples of its applications:

Market Segmentation:

Companies use K-means to partition customers into groups based on similar attributes
like buying habits, demographics, and psychographics. This allows for targeted
marketing strategies, personalized customer engageme nt, and efficient allocation of
marketing resources.

Document Clustering:

In text mining, K-means can be used to group similar documents, which helps
organize large volumes of text data. This grouping improves information retrieval
systems. Moreover, it he lps in the identification of thematic patterns across
documents.
Image Segmentation:

K-means is used in computer vision to segment digital images into distinct regions.
This is useful for object recognition, image compression, and enhancing the
effectiveness of other image processing tasks.

Anomaly Detection:

By clustering data into groups, K-means can help identify data points that do not
belong to any cluster or are significantly far from their centroids, which could be
potential anomalies or outliers.

Feature Learning:

It can be used for feature engineering by identifying prominent clusters within feature
sets, which can then be used to create new features that help improve the performance
of predictive models.

Pattern Recognition:

K-means is employed to find patterns and associations in datasets, such as identifying


the behavior of complex systems or common sequences in DNA strings.

Astronomy:

In astronomy, K-means helps in clustering and classifying galaxies based on their


shapes and other properties, aiding in the study of cosmic evolution and the universe’s
structure.

Insurance Fraud Detection:

Insurers utilize clustering to find patterns and groupings in claims data that may
indicate fraudulent activit y, enabling them to focus investigative resources more
effectively.

Healthcare:
In healthcare, K-means can cluster patient data to identify commonalities in medical
conditions, leading to more personalized treatment plans and a better understanding
of patient needs.

Operational Efficiency:

Manufacturing and service industries use K-means to improve operational efficiency


by clustering machines with similar profiles or failures, streamlining maintenance
and support processes.

Urban Planning:

K-means can be applied to cluster houses, roads, and ar eas to optimize the planning
of utilities and services, like hospitals, schools, and emergency services.

Social Network Analysis:

Clustering users based on their connections and interactions can reveal communit ies
within social networks, influential users, and the flow of information.

Conclusion

The above applications leverage K-means’ ability to make sense of unlabeled data by
finding natural groupings within.

However, the success of K-means in these applications depends on the nature of the
data and the careful selection of the number of clusters k. Other clustering techniques
may be more appropriate in cases where the assumptions of K-means do not hold,
such as non-spherical clusters or clusters of different sizes and densities.
4.10 INSTANCE BASED LEARNING
What is Instance-based Learning?

Instance-based Learning, also known as Memory-based Learning, is a family of learning


algorithms that, instead of performing explicit generalization, compares new problem instances
with instances seen in training. It is an integral part of Machine Learning, used to classify new
instances based on a similarity measure.

History

Instance-based Learning algorithms were developed in the 1980s as a part of machine learning
research. The most widely recognized algorithm within this family is the K-Nearest Neighbor
(K-NN) Algorithm. These algorithms have seen extensive use in various data mining
applications and pattern recognition.

Functionality and Features

Instance-based Learning models employ an approach where the instances are used during the
learning process. The core features of these models include the following:

 Classification: Instance-based Learning excels in classifying unseen data based on the


proximity to the known instances.
 Regression: These models can also perform regression tasks, predicting continuous
outputs.
 Lazy Learning: Being a type of lazy learning, Instance-based Learning does not create
general models but uses training data during the testing phase too.

Security Aspects

While Instance-based Learning itself does not have inherent security measures, it is critical to
ensure data privacy and security when using this method, especially in a data lakehouse setup.
It's vital to encrypt sensitive data and use access controls to protect the integrity and privacy of
the data.
Performance

The performance of Instance-based Learning is greatly influenced by the choice of similarity


function and the size of the dataset. In a data lakehouse environment, performance can be
significantly enhanced through distributed computing and scalable storage infrastructure.

Instance-based and model-based learning are two different approaches in machine learning for
solving tasks and making predictions. Let's delve into each of these approaches:

1. Instance-Based Learning:

Instance-based learning, also known as lazy learning, is a type of machine learning where the
model doesn't explicitly learn a general representation of the underlying data distribution.
Instead, it stores the training instances themselves and uses them to make predictions on new,
unseen instances. The key idea is to compare the new instance to the stored instances and find
the most similar ones to make a prediction.
The primary algorithm in instance-based learning is k-nearest neighbors (k-NN). In k-NN,
when given a new data point, the algorithm identifies the k-closest training instances
(neighbors) and uses their labels to determine the label of the new instance. The choice of k
determines the level of smoothness or noise resistance of the predictions.
Some of the instance-based learning algorithms are :

1. K Nearest Neighbor (KNN)


2. Self-Organizing Map (SOM)
3. Learning Vector Quantization (LVQ)
4. Locally Weighted Learning (LWL)
5. Case-Based Reasoning

The Machine Learning systems which are categorized as instance-based learning are the
systems that learn the training examples by heart and then generalizes to new instances based
on some similarity measure. It is called instance-based because it builds the hypotheses from
the training instances. It is also known as memory-based learning or lazy-
learning (because they delay processing until a new instance must be classified). The time
complexity of this algorithm depends upon the size of training data. Each time whenever a
new query is encountered, its previously stores data is examined. And assign to a target
function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of training
instances. For example, If we were to create a spam filter with an instance-based learning
algorithm, instead of just flagging emails that are already marked as spam emails, our spam
filter would be programmed to also flag emails that are very similar to them. This requires a
measure of resemblance between two emails. A similarity measure between two emails could
be the same sender or the repetitive use of the same keywords or something else.

Benefits and Use Cases

Instance-based Learning provides various benefits:

 Flexibility: It can adapt quickly to changes as it does not rely on a previously built model.
 Ease of implementation: The algorithm is easy to implement and understand.
 No Training Phase: It does not require a training phase as each instance represents itself.

Instance-based Learning has found applications in many areas such as recommendation


systems, image recognition, and computer vision.

Challenges and Limitations

Despite its benefits, Instance-based Learning also has its limitations:

 Performance: The performance heavily depends on the quality of the dataset.


 Time and Space Intensive: The algorithm can be computationally expensive and require
significant storage space, especially for larger datasets.
 Sensitivity to Irrelevant Features: It is affected by irrelevant features, which may cause
misclassification.

Integration with Data Lakehouse

In a data lakehouse environment, Instance-based Learning can be used to analyze and classify
the vast, diverse data. It can help in creating segmentation and classification rules for better
data organization and retrieval. Further, with the scale-out architecture of a data lakehouse, vast
amounts of instance data can be stored and processed efficiently, overcoming one of the major
limitations of Instance-based Learning.

Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go.
3. No explicit model training is required, so it's easy to adapt to new data.
4. Can handle complex relationships and adapt well to varying data distributions.
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch.
3. Can be computationally expensive, especially with large datasets.
4. Sensitive to irrelevant or noisy features in the data.
5. Lack of a learned model makes it harder to interpret or explain predictions.
4.12 GAUSSIAN MIXTURE MPDELS AND EXPECTATION MAXIMIZATIONS

GMM algorithm

The Gaussian mixture model (GMM) is a probabilistic model that assumes the data points
come from a limited set of Gaussian distributions with uncertain variables. The mean and
covariance matrix characterizes each individual Gaussian distribution.

As an extension of the k-means clustering technique, a GMM takes into account the data’s
covariance structure and the likelihood of each point being derived from each Gaussian
distribution.

The Gaussian Mixture Model or GMM is defined as a mixture model that has a combination
of the unspecified probability distribution function. Further, GMM also requires estimated
statistics values such as mean and standard deviation or parameters. It is used to estimate the
parameters of the probability distributions to best fit the density of a given training dataset.
Although there are plenty of techniques available to estimate the parameter of the Gaussian
Mixture Model (GMM), the Maximum Likelihood Estimation is one of the most popular
techniques among them.

Each data point is given a chance of belonging to each cluster, making it a soft Gaussian mixture
model clustering technique. This provides more leeway and may accommodate scenarios when
data points do not naturally fall into one cluster.

Let's understand a case where we have a dataset with multiple data points generated by two
different processes. However, both processes contain a similar Gaussian probability
distribution and combined data. Hence it is very difficult to discriminate which distribution a
given point may belong to.

The processes used to generate the data point represent a latent variable or unobservable data.
In such cases, the Estimation-Maximization algorithm is one of the best techniques which helps
us to estimate the parameters of the gaussian distributions. In the EM algorithm, E-step
estimates the expected value for each latent variable, whereas M-step helps in optimizing them
significantly using the Maximum Likelihood Estimation (MLE). Further, this process is
repeated until a good set of latent values, and a maximum likelihood is achieved that fits the
data.

The GMM is trained using the EM algorithm, an iterative approach for determining the most
likely estimations of the mixture Gaussian distribution parameters. The EM method first makes
rough guesses at the parameters, then repeatedly improves those guesses until convergence is
reached.

The GaussianMixture class from the Scikit-learn toolkit makes it possible to implement the
Gaussian mixture model in Python. It offers many choices for configuring the algorithm’s
initialization, covariance type, and other settings, and it’s quite easy to use.

This is how the GMM algorithm works:

 Initialize phase: Gaussian distributions’ parameters should be initialized (means, covariances,


and mixing coefficients).

 Expectation phase: Determine the likelihood that each data point was created using each of the
Gaussian distributions.

 Maximization phase: Apply the probabilities found in the expectation step to re-estimate the
Gaussian distribution parameters.

 Final phase: To achieve convergence of the parameters, repeat steps 2 and 3.

GMM equation

The Gaussian mixture model equation defines the probability density function of a multivariate
Gaussian mixture. Pdf is a mathematical function that characterizes the likelihood that a given
data point, x, belongs to a certain cluster or component, k.

The pdf for a GMM with K clusters is extracted by this Gaussian equation:

 pdf(x) = Σ(k=1 to K) π_k * N(x|μ_k, Σ_k)

Where:
 π_k is the mixing coefficient.

 μ_k is the mean vector.

 Σ_k is the covariance matrix.

 N(x|μ_k, Σ_k) is the probability density function.

The mixing coefficients π_k are non-negative and sum to 1. They represent the proportion of
the data points that belong to cluster k.

The Gaussian distributions, represented by N(x|μ_k, Σ_k), are the likelihoods of the data points
x given the cluster parameters. Each cluster k has its own mean vector μ_k and covariance
matrix Σ_k.

Finding the most likely estimates of the parameters of the Gaussian distributions requires
solving the GMM equation, which is used in the Expectation-Maximization (EM) process. The
EM method first makes rough guesses at the parameters, then repeatedly improves those guesses
until convergence is reached.

Application of GMM

Density estimation and clustering both benefit greatly from the use of GMM. Useful for both
data creation and imputation, GMM’s generative model capability makes it a powerful tool in a
variety of contexts.

Additionally, GMM may be used to model data with several modes, where each mode is
represented by a Gaussian distribution.

Recently, GMM has been put to use in the feature extraction of voice data for use in speech
recognition systems. They have also seen widespread application in multi-object tracking,
where the number of components and respective means are used to forecast object placements
at each frame of a video sequence. To enable object tracking over time, the EM method is
employed to update the component means between video frames.
EM Algorithm in Machine Learning

The EM algorithm is considered a latent variable model to find the local maximum
likelihood parameters of a statistical model, proposed by Arthur Dempster, Nan Laird, and
Donald Rubin in 1977. The EM (Expectation-Maximization) algorithm is one of the most
commonly used terms in machine learning to obtain maximum likelihood estimates of variables
that are sometimes observable and sometimes not. However, it is also applicable to unobserved
data or sometimes called latent. It has various real-world applications in statistics, including
obtaining the mode of the posterior marginal distribution of parameters in machine
learning and data mining applications.

In most real-life applications of machine learning, it is found that several relevant learning
features are available, but very few of them are observable, and the rest are unobservable. If
the variables are observable, then it can predict the value using instances. On the other hand,
the variables which are latent or directly not observable, for such variables Expectation-
Maximization (EM) algorithm plays a vital role to predict the value with the condition that the
general form of probability distribution governing those latent variables is known to us. In this
topic, we will discuss a basic introduction to the EM algorithm, a flow chart of the EM
algorithm, its applications, advantages, and disadvantages of EM algorithm, etc.

What is an EM algorithm?

The Expectation-Maximization (EM) algorithm is defined as the combination of various


unsupervised machine learning algorithms, which is used to determine the local maximum
likelihood estimates (MLE) or maximum a posteriori estimates (MAP) for unobservable
variables in statistical models. Further, it is a technique to find maximum likelihood estimation
when the latent variables are present. It is also referred to as the latent variable model.

A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable. These
unobservable variables are known as latent variables.

Key Points:

o It is known as the latent variable model to determine MLE and MAP parameters for
latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values occurs.

EM Algorithm

The EM algorithm is the combination of various unsupervised ML algorithms, such as the k-


means clustering algorithm. Being an iterative approach, it consists of two modes. In the first
mode, we estimate the missing or latent variables. Hence it is referred to as
the Expectation/estimation step (E-step). Further, the other mode is used to optimize the
parameters of the models so that it can explain the data more clearly. The second mode is
known as the maximization-step or M-step.

o Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.

o Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.

o Repeat E-step and M-step until the convergence of the values occurs.

The primary goal of the EM algorithm is to use the available observed data of the dataset to
estimate the missing data of the latent variables and then use that data to update the values of
the parameters in the M-step.
What is Convergence in the EM algorithm?

Convergence is defined as the specific situation in probability based on intuition, e.g., if there
are two random variables that have very less difference in their probability, then they are known
as converged. In other words, whenever the values of given variables are matched with each
other, it is called convergence.

Steps in EM Algorithm

The EM algorithm is completed mainly in 4 steps, which include Initialization Step,


Expectation Step, Maximization Step, and convergence Step. These steps are explained as
follows:

o 1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained from
a specific model.

o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data. Further, E-
step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or not.
If it gets "yes", then stop the process; else, repeat the process from step 2 until the
convergence occurs.

Applications of EM algorithm

The primary aim of the EM algorithm is to estimate the missing data in the latent variables
through observed data in datasets. The EM algorithm or latent variable model has a broad range
of real-life applications in machine learning. These are as follows:

o The EM algorithm is applicable in data clustering in machine learning.


o It is often used in computer vision and NLP (Natural language processing).
o It is used to estimate the value of the parameter in mixed models such as the Gaussian
Mixture Modeland quantitative genetics.

o It is also used in psychometrics for estimating item parameters and latent abilities of
item response theory models.

o It is also applicable in the medical and healthcare industry, such as in image


reconstruction and structural engineering.

o It is used to determine the Gaussian density of a function.

Advantages of EM algorithm

o It is very easy to implement the first two basic steps of the EM algorithm in various
machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
o It often generates a solution for the M-step in the closed form.

Disadvantages of EM algorithm

o The convergence of the EM algorithm is very slow.


o It can make convergence for the local optima only.
o It takes both forward and backward probability into consideration. It is opposite to that
of numerical optimization, which takes only forward probabilities.

You might also like