Professional Documents
Culture Documents
AI ML UNIT 4 NOTES
AI ML UNIT 4 NOTES
Generally, we develop machine learning models to solve a problem by using the given input data.
When we work on a single algorithm, we are unable to distinguish the performance of the model
for that particular statement, as there is nothing to compare it against. So, we feed the input data
to other machine learning algorithms and then compare them with each other to know which is
the best algorithm that suits the given problem. Every algorithm has its own mathematical
computation and significance to deal with a specific problem to bring out the best results at the
end.
While dealing with a specific problem with a machine learning algorithm we sometimes fail,
because of the poor performance of the model. The algorithm that we have used may be well
suited to the model, but we still fail in getting better outcomes at the end. In this situation, we
might have many questions in our mind. How can we bring out better results from the model?
What are the steps to be taken further in the model development? What are the hidden techniques
that can help to develop an efficient model?
To overcome this situation there is a procedure called “Combining Models”, where we mix one
or two weaker machine learning models to solve a problem and get better outcomes. In machine
learning, the combining of models is done by using two approaches namely “Ensemble Models”
& “Hybrid Models”.
Ensemble Models use multiple machine learning algorithms to bring out better predictive results,
as compared to using a single algorithm. There are different approaches in Ensemble models to
perform a particular task. There is another model called Hybrid model that is flexible and helps
to create a more innovative model than an Ensemble model. While combining models we need to
check how strong or weak a particular machine learning model is, to deal with a particular
problem.
Ensemble learning combines multiple machine learning models into a single model. The aim is to
increase the performance of the model. Bagging aims to decrease variance, boosting aims to decrease
bias, and stacking aims to improve prediction accuracy. Bagging and boosting combine homogenous
weak learners.
Ensemble learning is one of the most powerful machine learning techniques that use the combined
output of two or more models/weak learners and solve a particular computational intelligence
problem. E.g., a Random Forest algorithm is an ensemble of various decision trees combined.
Ensemble learning is primarily used to improve the model performance, such as classification,
prediction, function approximation, etc. When designing a learning machine, we generally make
some choices like parameters of machine, training data, representation, etc. This implies some sort
of variance in performance. For example, in a classification setting, we can use a parametric
classifier or in a multilayer perceptron, we should also decide on the number of hiddenunits.
Each learning algorithm dictates a certain model that comes with a set of assumptions. This
inductive bias leads to error if the assumptions do not hold for the data.
Different learning algorithms have different accuracies. The no free lunch theorem asserts that no
single learning algorithm always achieves the best performance in any domain. They can be
combined to attain higher accuracy
• Data fusion is the process of fusing multiple records representing the same real-world object
into a single, consistent, and clean representation Fusion of data for improving prediction
accuracy and reliability is an important problem in machine learning.
• Combining different models is done to improve the performance of deep learning models.
Building a new model by combination requires less time, data, and computational resources. The
most common method to combine models is by averaging multiple models, where taking a
weighted average improves the accuracy.
Different Hyper-parameters: We can use the same learning algorithm but use it with different
hyper-parameters.
Different Training Sets: Another possibility is to train different base-learners by different subsetsof
the training set.
b) Local approach (leamer selection) in mixture of experts, there is a gating model, which looks
at the input and chooses one (or very few) of the learners as responsible for generating the output.
2.Multistage combination
Multistage combination methods use a serial approach where the next multistage combination base-
learner is trained with or tested on only the instances where the previous base-learners are not
accurate enough.
Let's assume that we want to construct a function that maps inputs to outputs from a set of
known Ntraininput-output pairs.
Classification: When the output takes values in a discrete set of class labels Y = {c1;c2;…cx},
where K is the number of different classes. Regression consists in predicting continuous ordered
outputs, Y=R
1.2 Voting
The simplest way to combine multiple classifiers is by voting, which corresponds to taking a
linear combination of the learners. Voting is an ensemble machine learning algorithm.
• For regression, a voting ensemble involves making a prediction that is the average of multiple
other regression models.
A Voting Classifier is a machine learning model that trains on an ensemble of numerous models
and predicts an output (class) based on their highest probability of chosen class as the output.
• It simply aggregates the findings of each classifier passed into Voting Classifier and
predicts the output class based on the highest majority of voting.
• The idea is instead of creating separate dedicated models and finding the accuracy for each
them, we create a single model which trains by these models and predicts output based on
their combined majority of voting for each output class.
Voting Classifier supports two types of votings.
Hard Voting: In hard voting, the predicted output class is a class with the highest majority of
votes i.e the class which had the highest probability of being predicted by each of the classifiers.
Suppose three classifiers predicted the output class(A, A, B), so here the majority predicted A as
output. Hence A will be the final prediction.
Soft Voting: In soft voting, the output class is the prediction based on the average of probability
given to that class.
• Suppose given some input to three models, the prediction probability for class A = (0.30,
0.47, 0.53) and B = (0.20, 0.32, 0.40). So the average for class A is 0.4333 and B is 0.3067,
the winner is clearly class A because it had the highest probability averaged by each
classifier.
In this methods, the first step is to create multiple classification/regression models using some
training dataset. Each base model can be created using different splits of the same training dataset
and same algorithm, or using the same dataset with different algorithms, or any other method.
Learn multiple alternative definitions of a concept using different training data or different
learning algorithms. It combines decisions of multiple definitions,
Ensemble Learning Techniques in Machine Learning, Machine learning models suffer bias
and/or variance. Bias is the difference between the predicted value and actual value by the
model. Bias is introduced when the model doesn’t consider the variation of data and creates a
simple model. The simple model doesn’t follow the patterns of data, and hence the model gives
errors in predicting training as well as testing data i.e. the model with high bias and high
variance. When the model follows even random quirks of data, as pattern of data, then the
model might do very well on training dataset i.e. it gives low bias, but it fails on test data and
gives high variance.
Therefore, to improve the accuracy (estimate) of the model, ensemble learning methods are
developed. Ensemble is a machine learning concept, in which several models are trained using
machine learning algorithms. It combines low performing classifiers (also called as weak
learners or base learner) and combine individual model prediction for the final prediction.
On the basis of type of base learners, ensemble methods can be categorized as homogeneous
and heterogeneous ensemble methods. If base learners are same, then it is a homogeneous
ensemble method. If base learners are different then it is a heterogeneous ensemble method.
2. Boosting
3. Stacking
A high-bias model results from not learning data well enough. It is not related to
the distribution of the data. Hence future predictions will be unrelated to the data
and thus incorrect.
A high variance model results from learning the data too well. It varies with each
data point. Hence it is impossible to predict the next point accurately.
Both high bias and high variance models thus cannot generalize properly. Thus, weak
learners will either make incorrect generalizations or fail to generalize altogether. Because
of this, the predictions of weak learners cannot be relied on by themselves.
As we know from the bias-variance trade-off, an underfit model has high bias and low
variance, whereas an overfit model has high variance and low bias. In either case, there is
no balance between bias and variance. For there to be a balance, both the bias and variance
need to be low. Ensemble learning tries to balance this bias-variance trade-off by reducing
either the bias or the variance.
It aims to reduce the bias if we have a weak model with high bias and low
variance. Ensemble learning will aim to reduce the variance if we have a weak model with
high variance and low bias. This way, the resulting model will be much more balanced,
with low bias and variance. Thus, the resulting model will be known as a strong
learner. This model will be more generalized than the weak learners. It will thus be able
to make accurate predictions.
Bagging is used to reduce the variance of weak learners. Boosting is used to reduce the
bias of weak learners. Stacking is used to improve the overall accuracy of strong learners.
We use bagging for combining weak learners of high variance. Bagging aims to produce
a model with lower variance than the individual weak models. These weak learners are
homogenous, meaning they are of the same type.
Bagging is also known as Bootstrap aggregating. It consists of two steps: bootstrapping
and aggregation.
Bootstrapping
Involves resampling subsets of data with replacement from an initial dataset. In other
words, subsets of data are taken from the initial dataset. These subsets of data are called
bootstrapped datasets or, simply, bootstraps. Resampled ‘with replacement’ means an
individual data point can be sampled multiple times. Each bootstrap dataset is used to train
a weak learner.
Aggregating
The individual weak learners are trained independently from each other. Each learner
makes independent predictions. The results of those predictions are aggregated at the end
to get the overall prediction. The predictions are aggregated using either max voting or
averaging.
Max Voting
It is a commonly used for classification problems that consists of taking the mode of the
predictions (the most occurring prediction). It is called voting because like in election
voting, the premise is that ‘the majority rules’. Each model makes a prediction. A
prediction from each model counts as a single ‘vote’. The most occurring ‘vote’ is chosen
as the representative for the combined model.
Averaging
It is generally used for regression problems. It involves taking the average of the
predictions. The resulting average is used as the overall prediction for the combined model.
Steps of Bagging
2. We create a m-number of subsets of data from the training set. We take a subset
of N sample points from the initial dataset for each subset. Each subset is taken
with replacement. This means that a specific data point can be sampled more than
once.
3. For each subset of data, we train the corresponding weak learners independently.
These models are homogeneous, meaning that they are of the same type.
5. The predictions are aggregated into a single prediction. For this, either max voting
or averaging is used.
Reducing Bias by Boosting
We use boosting for combining weak learners with high bias. Boosting aims to produce a
model with a lower bias than that of the individual models. Like in bagging, the weak
learners are homogeneous.
Boosting involves sequentially training weak learners. Here, each subsequent learner
improves the errors of previous learners in the sequence. A sample of data is first taken
from the initial dataset. This sample is used to train the first model, and the model makes
its prediction. The samples can either be correctly or incorrectly predicted. The samples
that are wrongly predicted are reused for training the next model. In this way, subsequent
models can improve on the errors of previous models.
Unlike bagging, which aggregates prediction results at the end, boosting aggregates the
results at each step. They are aggregated using weighted averaging.
Weighted averaging involves giving all models different weights depending on their
predictive power. In other words, it gives more weight to the model with the highest
predictive power. This is because the learner with the highest predictive power is
considered the most important.
Steps of Boosting
Boosting works with the following steps:
3. We test the trained weak learner using the training data. As a result of the testing,
some data points will be incorrectly predicted.
4. Each data point with the wrong prediction is sent into the second subset of data,
and this subset is updated.
5. Using this updated subset, we train and test the second weak learner.
6. We continue with the following subset until the total number of subsets is reached.
7. We now have the total prediction. The overall prediction has already been
aggregated at each step, so there is no need to calculate it.
We use stacking to improve the prediction accuracy of strong learners. Stacking aims to
create a single robust model from multiple heterogeneous strong learners.
Individual heterogeneous models are trained using an initial dataset. These models make
predictions and form a single new dataset using those predictions. This new data set is
used to train the metamodel, which makes the final prediction. The prediction is combined
using weighted averaging.
Because stacking combines strong learners, it can combine bagged or boosted models.
Steps of Stacking
4. Using the results of the meta-model, we make the final prediction. The results are
combined using weighted averaging.
If you want to reduce the overfitting or variance of your model, you use bagging and if
you are looking to reduce underfitting or bias, you use boosting. However, if you want to
increase predictive accuracy, use stacking.
Bagging and boosting both works with homogeneous weak learners. Stacking works using
heterogeneous solid learners.
All three of these methods can work with either classification or regression problems.
On the other hand, the converse is true. It is not advisable to use bagging to reduce bias or
underfitting. This is because bagging is more prone to bias and does not help reduce bias.
Stacked models have the advantage of better prediction accuracy than bagging or boosting.
But because they combine bagged or boosted models, they have the disadvantage of
needing much more time and computational power. If you are looking for faster results,
it’s advisable not to use stacking. However, stacking is the way to go if you’re looking for
high accuracy.
Conclusion
Bagging, boosting, and stacking are vital techniques in ensemble learning in machine
learning. They play a crucial role in enhancing model accuracy and mitigating the risks
associated with inaccurate predictions.
Ensemble learning combines multiple machine learning models into a single model. The
aim is to increase the performance of the model.
Bagging aims to decrease variance, boosting aims to decrease bias, and stacking
aims to improve prediction accuracy.
Bagging trains models in parallel and boosting trains the models sequentially.
Stacking creates a meta-model.
4.8 UNSUPERVISED LEARNING
There may be many cases in which we do not have labeled data and need to find the hidden
patterns from the given dataset. So, to solve such types of cases in machine learning, we need
unsupervised learning techniques.
As the name suggests, unsupervised learning is a machine learning technique in which models
are not supervised using training dataset. Instead, models itself find the hidden patterns and
insights from the given data. It can be compared to learning which takes place in the human
brain while learning new things. It can be defined as:
Unsupervised learning is a type of machine learning in which models are trained using
unlabeled dataset and are allowed to act on that data without any supervision.
Example: Suppose the unsupervised learning algorithm is given an input dataset containing
images of different types of cats and dogs. The algorithm is never trained upon the given
dataset, which means it does not have any idea about the features of the dataset. The task of the
unsupervised learning algorithm is to identify the image features on their own. Unsupervised
learning algorithm will perform this task by clustering the image dataset into the groups
according to similarities between images.
Why use Unsupervised Learning?
Below are some main reasons which describe the importance of Unsupervised Learning:
o Unsupervised learning is helpful for finding useful insights from the data.
o Unsupervised learning is much similar as a human learns to think by their own
experiences, which makes it closer to the real AI.
o Unsupervised learning works on unlabeled and uncategorized data which make
unsupervised learning more important.
o In real-world, we do not always have input data with the corresponding output so to
solve such cases, we need unsupervised learning.
Once it applies the suitable algorithm, the algorithm divides the data objects into groups
according to the similarities and difference between the objects.
The unsupervised learning algorithm can be further categorized into two types of
problems:
o Clustering: Clustering is a method of grouping the objects into clusters such that
objects with most similarities remains into a group and has less or no similarities with
the objects of another group. Cluster analysis finds the commonalities between the data
objects and categorizes them as per the presence and absence of those commonalities.
o Association: An association rule is an unsupervised learning method which is used for
finding the relationships between variables in the large database. It determines the set
of items that occurs together in the dataset. Association rule makes marketing strategy
more effective. Such as people who buy X item (suppose a bread) are also tend to
purchase Y (Butter/Jam) item. A typical example of Association rule is Market Basket
Analysis.
Unsupervised Learning algorithms:
o K-means clustering
o KNN (k-nearest neighbors)
o Hierarchal clustering
o Anomaly detection
o Neural Networks
o Principle Component Analysis
o Independent Component Analysis
o Apriori algorithm
o Singular value decomposition
o The result of the unsupervised learning algorithm might be less accurate as input data
is not labeled, and algorithms do not know the exact output in advance.
4.9 K-MEANS
K-Means Algorithm
K-means aims to find groups in the data, with the number of groups represented by
the variable K. Based on the provided features, the algorithm works iteratively to
assign each data point to one of the “K” groups. Please note that the clustering of the
data points is by the similarity of features.
K-Means uses of Centroid Model algorithm. Centroid Models use the notion of data
points that are close to the center of clusters having a relation.
A target number referred to as ‘k’ is the number of centroids you need in the dataset.
A centroid is the cluster’s center, and each data point is assigned to a specific cluster.
The ‘means’ in the name refers to the averaging of the data, which helps to find the
centroid location.
The K-means algorithm aims to reduce the distance between the data points and their
cluster centroid.
What is a Centroid?
1. Choosing the number of clusters (k) : This is often done using methods like
the elbow or silhouette method. This helps determine the optimal number of
clusters by evaluating the within-cluster sum of squares (WCSS) or silhouette
score.
2. Randomly selecting initial centroids: While centroids can be randomly
selected, some implementations of K-means use more sophisticated methods
for initialization, such as the k-means++ algorithm, which aims to spread out
the initial centroids before proceeding with the standard K-means optimization
iterations.
3. Assigning data points to the nearest centroid: The algorithm assigns each
data point to the cluster with the closest centroid, typically using the Euclidean
distance metric.
4. Optimizing centroids: After the assignment, the centroids are recalculated as
the mean of all points in their cluster, which may or may not result in the
centroid moving.
5. Iterative process: The assignment and update steps (3 and 4) are repeated
until a stopping criterion is met. This could be a fixed number of iterations, a
threshold of minimal changes in centroid positions, minimal changes in the
assignment of the data points to the clusters, or when there is no further
decrease in WCSS.
Please note that the quality of the cluster assignments is determined by the within -
cluster sum of squares (WCSS), which t he algorithm tries to minimize.
One of the challenges in K-means is choosing the right K, i.e., the number of clusters.
This is often done using methods like the elbow method. In the elbow method, the
WCSS is plotted against the number of clusters, and the “elbow point” of the curve
is chosen as the optimal K.
No change in the centroids (i.e., the posit ions of the centroids do not change
between iterations)
No change in cluster assignments (i.e., data points remain in the same cluster)
A predetermined number of iterations is reached
The decrease in the within-cluster sum of squares (WCSS) between iterations
is below a certain threshold.
If the centroids of newly formed clusters are not changing, we can interpret
that the algorithm is not learning any new characteristics or patterns; acting as
a sign to stop the training.
If we have trained the algorithm for multiple iterations, however, the data
points remain the same.
Choosing the Correct Number of Clusters, k.
Regarding my simple example above, I chose 2 clusters purely for the article.
However, you might be wondering how to choose the right number of clusters.
There are two methods that you can use to select the correct value of k:
1. Elbow Method
2. Silhouette Method
Elbow Method
The Elbow Method is a popular tool that applies k-means for a range of values of k,
and we plot it against the Sum of Squared Error, referred to as a scree plot.
The image below shows that the total Sum of Squared Error decreases as we increase
the number of Clusters. The Elbow is a point from where there is no significant
reduction in the Sum of the Squared Errors, even if the number of clusters increases
further.
Using the Yellow Brick Library, you can use the following Python code to calculate
the Elbow method:
The above Python statement imports the “KElbowVisualizer” class from the
“yellowbrick.cluster” module.
Yellowbrick is a machine learning visualization library in Python that extends
the scikit-learn API with visual analysis and diagnostic tools.
Silhouette Method
The Silhouette method is different from the Elbow method. It calculates the
Silhouette coefficient of every point. A Silhouette Coefficient, or the Silhouette
Score, is a metric to calculate the efficiency of a clustering technique, ranging from
the values -1 to 1.
The computation of the Silhouette Coefficient takes place for each data point by first
determining the average distance from the point to the other points in its cluster,
denoted as ‘a’. It then measures the average distance from the point to the points in
the nearest cluster that it is not a part of, denoted as ‘b’. The Silhouette Coefficient
is calculated using these average distances.
S = b-a / max(a,b)
In this example, we will implement a K-means Clustering algorithm using Python and
Scikit-learn library to visualize a simple K-means clustering.
1. Import Libraries
import numpy as np
# data visualisation
# Scikit-learn library
%matplotlib inline
X= -2 * np.random.rand(150,2)
X1 = 1 + 2 * np.random.rand(50,2)
X[50:100, :] = X1
3. Scikit-Learn
I will use the tools available in the Scikit-Learn library to process the randomly
generated data into clusters. This is where I am selecting my number of clusters. For
this example, I will be using two clusters:
# number of clusters, k = 2
Kmean = KMeans(n_clusters=2)
After selecting the number of clusters, ‘k’. Our next step was to find the Centroid. In
order to do this, we run this code:
Kmean.cluster_centers_
The output:
You can see the assignment of two centroids, one is Green, and the other is Orange.
plt.show()
If you would like to test the algorithm or see which data points have been assigned
to which characteristic, you can do so by the following code:
Kmean.labels_
Output:
array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)
6. Testing
If you wish to test your K-means cluster, generate a sample, reshape it, and use
“Kmeans.predict”. This will assign the new data point to a centroid. Please chec k
the following code:
sample_test=np.array([-2.0,-2.0])
second_test=sample_test.reshape(1, -1)
Kmean.predict(second_test)
Output:
array([1], dtype=int32)
Advantages:
Disadvantages:
Market Segmentation:
Companies use K-means to partition customers into groups based on similar attributes
like buying habits, demographics, and psychographics. This allows for targeted
marketing strategies, personalized customer engageme nt, and efficient allocation of
marketing resources.
Document Clustering:
In text mining, K-means can be used to group similar documents, which helps
organize large volumes of text data. This grouping improves information retrieval
systems. Moreover, it he lps in the identification of thematic patterns across
documents.
Image Segmentation:
K-means is used in computer vision to segment digital images into distinct regions.
This is useful for object recognition, image compression, and enhancing the
effectiveness of other image processing tasks.
Anomaly Detection:
By clustering data into groups, K-means can help identify data points that do not
belong to any cluster or are significantly far from their centroids, which could be
potential anomalies or outliers.
Feature Learning:
It can be used for feature engineering by identifying prominent clusters within feature
sets, which can then be used to create new features that help improve the performance
of predictive models.
Pattern Recognition:
Astronomy:
Insurers utilize clustering to find patterns and groupings in claims data that may
indicate fraudulent activit y, enabling them to focus investigative resources more
effectively.
Healthcare:
In healthcare, K-means can cluster patient data to identify commonalities in medical
conditions, leading to more personalized treatment plans and a better understanding
of patient needs.
Operational Efficiency:
Urban Planning:
K-means can be applied to cluster houses, roads, and ar eas to optimize the planning
of utilities and services, like hospitals, schools, and emergency services.
Clustering users based on their connections and interactions can reveal communit ies
within social networks, influential users, and the flow of information.
Conclusion
The above applications leverage K-means’ ability to make sense of unlabeled data by
finding natural groupings within.
However, the success of K-means in these applications depends on the nature of the
data and the careful selection of the number of clusters k. Other clustering techniques
may be more appropriate in cases where the assumptions of K-means do not hold,
such as non-spherical clusters or clusters of different sizes and densities.
4.10 INSTANCE BASED LEARNING
What is Instance-based Learning?
History
Instance-based Learning algorithms were developed in the 1980s as a part of machine learning
research. The most widely recognized algorithm within this family is the K-Nearest Neighbor
(K-NN) Algorithm. These algorithms have seen extensive use in various data mining
applications and pattern recognition.
Instance-based Learning models employ an approach where the instances are used during the
learning process. The core features of these models include the following:
Security Aspects
While Instance-based Learning itself does not have inherent security measures, it is critical to
ensure data privacy and security when using this method, especially in a data lakehouse setup.
It's vital to encrypt sensitive data and use access controls to protect the integrity and privacy of
the data.
Performance
Instance-based and model-based learning are two different approaches in machine learning for
solving tasks and making predictions. Let's delve into each of these approaches:
1. Instance-Based Learning:
Instance-based learning, also known as lazy learning, is a type of machine learning where the
model doesn't explicitly learn a general representation of the underlying data distribution.
Instead, it stores the training instances themselves and uses them to make predictions on new,
unseen instances. The key idea is to compare the new instance to the stored instances and find
the most similar ones to make a prediction.
The primary algorithm in instance-based learning is k-nearest neighbors (k-NN). In k-NN,
when given a new data point, the algorithm identifies the k-closest training instances
(neighbors) and uses their labels to determine the label of the new instance. The choice of k
determines the level of smoothness or noise resistance of the predictions.
Some of the instance-based learning algorithms are :
The Machine Learning systems which are categorized as instance-based learning are the
systems that learn the training examples by heart and then generalizes to new instances based
on some similarity measure. It is called instance-based because it builds the hypotheses from
the training instances. It is also known as memory-based learning or lazy-
learning (because they delay processing until a new instance must be classified). The time
complexity of this algorithm depends upon the size of training data. Each time whenever a
new query is encountered, its previously stores data is examined. And assign to a target
function value for the new instance.
The worst-case time complexity of this algorithm is O (n), where n is the number of training
instances. For example, If we were to create a spam filter with an instance-based learning
algorithm, instead of just flagging emails that are already marked as spam emails, our spam
filter would be programmed to also flag emails that are very similar to them. This requires a
measure of resemblance between two emails. A similarity measure between two emails could
be the same sender or the repetitive use of the same keywords or something else.
Flexibility: It can adapt quickly to changes as it does not rely on a previously built model.
Ease of implementation: The algorithm is easy to implement and understand.
No Training Phase: It does not require a training phase as each instance represents itself.
In a data lakehouse environment, Instance-based Learning can be used to analyze and classify
the vast, diverse data. It can help in creating segmentation and classification rules for better
data organization and retrieval. Further, with the scale-out architecture of a data lakehouse, vast
amounts of instance data can be stored and processed efficiently, overcoming one of the major
limitations of Instance-based Learning.
Advantages:
1. Instead of estimating for the entire instance set, local approximations can be made to the
target function.
2. This algorithm can adapt to new data easily, one which is collected as we go.
3. No explicit model training is required, so it's easy to adapt to new data.
4. Can handle complex relationships and adapt well to varying data distributions.
Disadvantages:
1. Classification costs are high
2. Large amount of memory required to store the data, and each query involves starting the
identification of a local model from scratch.
3. Can be computationally expensive, especially with large datasets.
4. Sensitive to irrelevant or noisy features in the data.
5. Lack of a learned model makes it harder to interpret or explain predictions.
4.12 GAUSSIAN MIXTURE MPDELS AND EXPECTATION MAXIMIZATIONS
GMM algorithm
The Gaussian mixture model (GMM) is a probabilistic model that assumes the data points
come from a limited set of Gaussian distributions with uncertain variables. The mean and
covariance matrix characterizes each individual Gaussian distribution.
As an extension of the k-means clustering technique, a GMM takes into account the data’s
covariance structure and the likelihood of each point being derived from each Gaussian
distribution.
The Gaussian Mixture Model or GMM is defined as a mixture model that has a combination
of the unspecified probability distribution function. Further, GMM also requires estimated
statistics values such as mean and standard deviation or parameters. It is used to estimate the
parameters of the probability distributions to best fit the density of a given training dataset.
Although there are plenty of techniques available to estimate the parameter of the Gaussian
Mixture Model (GMM), the Maximum Likelihood Estimation is one of the most popular
techniques among them.
Each data point is given a chance of belonging to each cluster, making it a soft Gaussian mixture
model clustering technique. This provides more leeway and may accommodate scenarios when
data points do not naturally fall into one cluster.
Let's understand a case where we have a dataset with multiple data points generated by two
different processes. However, both processes contain a similar Gaussian probability
distribution and combined data. Hence it is very difficult to discriminate which distribution a
given point may belong to.
The processes used to generate the data point represent a latent variable or unobservable data.
In such cases, the Estimation-Maximization algorithm is one of the best techniques which helps
us to estimate the parameters of the gaussian distributions. In the EM algorithm, E-step
estimates the expected value for each latent variable, whereas M-step helps in optimizing them
significantly using the Maximum Likelihood Estimation (MLE). Further, this process is
repeated until a good set of latent values, and a maximum likelihood is achieved that fits the
data.
The GMM is trained using the EM algorithm, an iterative approach for determining the most
likely estimations of the mixture Gaussian distribution parameters. The EM method first makes
rough guesses at the parameters, then repeatedly improves those guesses until convergence is
reached.
The GaussianMixture class from the Scikit-learn toolkit makes it possible to implement the
Gaussian mixture model in Python. It offers many choices for configuring the algorithm’s
initialization, covariance type, and other settings, and it’s quite easy to use.
Expectation phase: Determine the likelihood that each data point was created using each of the
Gaussian distributions.
Maximization phase: Apply the probabilities found in the expectation step to re-estimate the
Gaussian distribution parameters.
GMM equation
The Gaussian mixture model equation defines the probability density function of a multivariate
Gaussian mixture. Pdf is a mathematical function that characterizes the likelihood that a given
data point, x, belongs to a certain cluster or component, k.
The pdf for a GMM with K clusters is extracted by this Gaussian equation:
Where:
π_k is the mixing coefficient.
The mixing coefficients π_k are non-negative and sum to 1. They represent the proportion of
the data points that belong to cluster k.
The Gaussian distributions, represented by N(x|μ_k, Σ_k), are the likelihoods of the data points
x given the cluster parameters. Each cluster k has its own mean vector μ_k and covariance
matrix Σ_k.
Finding the most likely estimates of the parameters of the Gaussian distributions requires
solving the GMM equation, which is used in the Expectation-Maximization (EM) process. The
EM method first makes rough guesses at the parameters, then repeatedly improves those guesses
until convergence is reached.
Application of GMM
Density estimation and clustering both benefit greatly from the use of GMM. Useful for both
data creation and imputation, GMM’s generative model capability makes it a powerful tool in a
variety of contexts.
Additionally, GMM may be used to model data with several modes, where each mode is
represented by a Gaussian distribution.
Recently, GMM has been put to use in the feature extraction of voice data for use in speech
recognition systems. They have also seen widespread application in multi-object tracking,
where the number of components and respective means are used to forecast object placements
at each frame of a video sequence. To enable object tracking over time, the EM method is
employed to update the component means between video frames.
EM Algorithm in Machine Learning
The EM algorithm is considered a latent variable model to find the local maximum
likelihood parameters of a statistical model, proposed by Arthur Dempster, Nan Laird, and
Donald Rubin in 1977. The EM (Expectation-Maximization) algorithm is one of the most
commonly used terms in machine learning to obtain maximum likelihood estimates of variables
that are sometimes observable and sometimes not. However, it is also applicable to unobserved
data or sometimes called latent. It has various real-world applications in statistics, including
obtaining the mode of the posterior marginal distribution of parameters in machine
learning and data mining applications.
In most real-life applications of machine learning, it is found that several relevant learning
features are available, but very few of them are observable, and the rest are unobservable. If
the variables are observable, then it can predict the value using instances. On the other hand,
the variables which are latent or directly not observable, for such variables Expectation-
Maximization (EM) algorithm plays a vital role to predict the value with the condition that the
general form of probability distribution governing those latent variables is known to us. In this
topic, we will discuss a basic introduction to the EM algorithm, a flow chart of the EM
algorithm, its applications, advantages, and disadvantages of EM algorithm, etc.
What is an EM algorithm?
A latent variable model consists of both observable and unobservable variables where
observable can be predicted while unobserved are inferred from the observed variable. These
unobservable variables are known as latent variables.
Key Points:
o It is known as the latent variable model to determine MLE and MAP parameters for
latent variables.
o It is used to predict values of parameters in instances where data is missing or
unobservable for learning, and this is done until convergence of the values occurs.
EM Algorithm
o Expectation step (E - step): It involves the estimation (guess) of all missing values in
the dataset so that after completing this step, there should not be any missing value.
o Maximization step (M - step): This step involves the use of estimated data in the E-
step and updating the parameters.
o Repeat E-step and M-step until the convergence of the values occurs.
The primary goal of the EM algorithm is to use the available observed data of the dataset to
estimate the missing data of the latent variables and then use that data to update the values of
the parameters in the M-step.
What is Convergence in the EM algorithm?
Convergence is defined as the specific situation in probability based on intuition, e.g., if there
are two random variables that have very less difference in their probability, then they are known
as converged. In other words, whenever the values of given variables are matched with each
other, it is called convergence.
Steps in EM Algorithm
o 1st Step: The very first step is to initialize the parameter values. Further, the system is
provided with incomplete observed data with the assumption that data is obtained from
a specific model.
o 2nd Step: This step is known as Expectation or E-Step, which is used to estimate or
guess the values of the missing or incomplete data using the observed data. Further, E-
step primarily updates the variables.
o 3rd Step: This step is known as Maximization or M-step, where we use complete data
obtained from the 2nd step to update the parameter values. Further, M-step primarily
updates the hypothesis.
o 4th step: The last step is to check if the values of latent variables are converging or not.
If it gets "yes", then stop the process; else, repeat the process from step 2 until the
convergence occurs.
Applications of EM algorithm
The primary aim of the EM algorithm is to estimate the missing data in the latent variables
through observed data in datasets. The EM algorithm or latent variable model has a broad range
of real-life applications in machine learning. These are as follows:
o It is also used in psychometrics for estimating item parameters and latent abilities of
item response theory models.
Advantages of EM algorithm
o It is very easy to implement the first two basic steps of the EM algorithm in various
machine learning problems, which are E-step and M- step.
o It is mostly guaranteed that likelihood will enhance after each iteration.
o It often generates a solution for the M-step in the closed form.
Disadvantages of EM algorithm