Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 42

UNSUPERVISED

LEARNING
Module 6.1
WHAT IS UNSUPERVISED LEARNING?
• Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to analyze and

cluster unlabeled datasets.


• These algorithms discover hidden patterns or data groupings without the need for human intervention.

• Its ability to discover similarities and differences in information make it the ideal solution for exploratory

data analysis, cross-selling strategies, customer segmentation, and image recognition.

• In supervised machine learning models are trained using labeled data under the supervision of

training data.

• But there may be many cases in which we do not have labeled data and need to find the hidden

patterns from the given dataset.

• So, to solve such types of cases in machine learning, we need unsupervised learning techniques.

• Unsupervised learning is a machine learning technique in which models are not supervised

using training dataset.


2
• Instead, models itself find the hidden patterns and insights from the given data.
• It can be compared to learning which takes place in the human brain while learning new things. It can be
defined as:
“Unsupervised learning is a type of machine learning in which models are trained using unlabeled
dataset and are allowed to act on that data without any supervision.”
• Unsupervised learning cannot be directly applied to a regression or classification problem because unlike
supervised learning, we have the input data but no corresponding output data.
• The goal of unsupervised learning is to find the underlying structure of dataset, group that data
according to similarities, and represent that dataset in a compressed format.
• Example: Suppose the unsupervised learning algorithm is given
an input dataset containing images of different types of cats and
dogs.
• The algorithm is never trained upon the given dataset, which
means it does not have any idea about the features of the dataset.
• The task of the unsupervised learning algorithm is to identify the
image features on their own. Unsupervised learning algorithm will
perform this task by clustering the image dataset into the groups
according to similarities between images.

3
Below are some main reasons which describe the importance of Unsupervised Learning:

• Unsupervised learning is helpful for finding useful insights from the data.
• Unsupervised learning is much similar as a human learns to think by their own experiences, which
makes it closer to the real AI.
• Unsupervised learning works on unlabeled and uncategorized data which make unsupervised learning
more important.
• In real-world, we do not always have input data with the corresponding output so to solve such cases,
we need unsupervised learning.

4
WORKING
• Here, we have taken an unlabeled input data, which means it is not categorized and corresponding
outputs are also not given.

• Now, this unlabeled input data is fed to the machine learning model in order to train it.
• Firstly, it will interpret the raw data to find the hidden patterns from the data and then will apply
suitable algorithms such as k-means clustering, Decision tree, etc.
• Once it applies the suitable algorithm, the algorithm divides the data objects into groups according to
the similarities and difference between the objects.

5
ADVANTAGES OF UNSUPERVISED LEARNING
• Unsupervised learning is used for more complex tasks as compared to supervised learning because, in
unsupervised learning, we don't have labeled input data.
• Unsupervised learning is preferable as it is easy to get unlabeled data in comparison to labeled data.
• It does not require training data to be labeled.
• Dimensionality reduction can be easily accomplished using unsupervised learning.
• Capable of finding previously unknown patterns in data.
• Flexibility: Unsupervised learning is flexible in that it can be applied to a wide variety of problems,
including clustering, anomaly detection, and association rule mining.
• Exploration: Unsupervised learning allows for the exploration of data and the discovery of novel and
potentially useful patterns that may not be apparent from the outset.
• Low cost: Unsupervised learning is often less expensive than supervised learning because it doesn’t
require labeled data, which can be time-consuming and costly to obtain.

6
DISADVANTAGES OF UNSUPERVISED LEARNING
• Unsupervised learning is intrinsically more difficult than supervised learning as it does not have
corresponding output.
• The result of the unsupervised learning algorithm might be less accurate as input data is not labeled, and
algorithms do not know the exact output in advance.
• Difficult to measure accuracy or effectiveness due to lack of predefined answers during training.
• The results often have lesser accuracy.
• The user needs to spend time interpreting and label the classes which follow that classification.
• Lack of guidance: Unsupervised learning lacks the guidance and feedback provided by labeled data,
which can make it difficult to know whether the discovered patterns are relevant or useful.
• Sensitivity to data quality: Unsupervised learning can be sensitive to data quality, including missing
values, outliers, and noisy data.
• Scalability: Unsupervised learning can be computationally expensive, particularly for large datasets or
complex algorithms, which can limit its scalability.

7
8
UNSUPERVISED LEARNING
APPROACHES
Unsupervised learning models are utilized for three main tasks—clustering, association, and
dimensionality reduction.

CLUSTERING
• Clustering is a data mining technique which groups unlabeled data based on their similarities or
differences.
• Clustering algorithms are used to process raw, unclassified data objects into groups represented by
structures or patterns in the information.
• Clustering algorithms can be categorized into a few types, exclusive, overlapping, hierarchical, and
probabilistic.

ASSOCIATION RULES
• An association rule is a rule-based method for finding relationships between variables in a given
dataset.
• These methods are frequently used for market basket analysis, allowing companies to better understand
relationships between different products.
• Understanding consumption habits of customers enables businesses to develop better cross-selling
strategies and recommendation engines.
9
ASSOCIATION
RULES

10
ASSOCIATION RULES
• Association rule learning is a type of unsupervised learning technique that checks for the dependency
of one data item on another data item and maps accordingly so that it can be more profitable.
• It tries to find some interesting relations or associations among the variables of dataset.
• It is based on different rules to discover the interesting relations between variables in the database.
• The association rule learning is one of the very important concepts of machine learning, and it is
employed in Market Basket analysis, Web usage mining, continuous production, etc.
• Here market basket analysis is a technique used by the various big retailer to discover the associations
between items.
• In market basket analysis, association rules are used to predict the likelihood of products being
purchased together.
• Association rules count the frequency of items that occur together, seeking to find associations that
occur far more often than expected.

11
MARKET BASKET ANALYSIS

12
APRIORI
ALGORITHM

13
APRIORI ALGORITHM
• Apriori algorithm is used to find frequent item sets in a dataset for Boolean association rule.
• Name of the algorithm is Apriori because it uses prior knowledge of frequent itemset properties.
• We apply an iterative approach or level-wise search where k-frequent item sets are used to find k+1
item sets.
• To improve the efficiency of level-wise generation of frequent item sets, an important property is
used called Apriori property which helps by reducing the search space.
• All non-empty subset of frequent itemset must be frequent.
• The key concept of Apriori algorithm is its anti-monotonicity of support measure.
• Apriori assumes that “All subsets of a frequent itemset must be frequent (Apriori property). If an
itemset is infrequent, all its supersets will be infrequent.”

14
15
CLUSTER
ANALYSIS

16
CLUSTER ANALYSIS
• Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same
group (called a cluster) are more like each other than to those in other groups (clusters).
• This process is often used for exploratory data analysis and can help identify patterns or relationships within
the data that may not be immediately obvious.
• There are many different algorithms used for cluster analysis, such as k-means, hierarchical clustering, and
density-based clustering.
• The choice of algorithm will depend on the specific requirements of the analysis and the nature of the data
being analyzed.
• Cluster Analysis is the process to find similar groups of objects in order to form clusters.
• It is an unsupervised machine learning-based algorithm that acts on unlabeled data.
• A group of data points would comprise together to form a cluster in which all the objects would belong to the
same group.
• The given data is divided into different groups by combining similar objects into a group. This group is
nothing but a cluster.
• A cluster is nothing but a collection of similar data which is grouped together.
• For example, consider a dataset of vehicles given in which it contains information about different vehicles
like cars, buses, bicycles, etc.
• As it is unsupervised learning there are no class labels like Cars, Bikes, etc for all the vehicles, all the data is
combined and is not in a structured manner.

17
PROPERTIES OF CLUSTERING
1. Clustering Scalability: Nowadays there is a vast amount of data and should be dealing with huge databases.
In order to handle extensive databases, the clustering algorithm should be scalable. Data should be scalable,
if it is not scalable, then we can’t get the appropriate result which would lead to wrong results.
2. High Dimensionality: The algorithm should be able to handle high dimensional space along with the data of
small size.
3. Algorithm Usability with multiple data kinds: Different kinds of data can be used with algorithms of
clustering. It should be capable of dealing with different types of data like discrete, categorical and interval-
based data, binary data etc.
4. Dealing with unstructured data: There would be some databases that contain missing values, and noisy or
erroneous data. If the algorithms are sensitive to such data, then it may lead to poor quality clusters. So, it
should be able to handle unstructured data and give some structure to the data by organizing it into groups of
similar data objects. This makes the job of the data expert easier in order to process the data and discover
new patterns.
5. Interpretability: The clustering outcomes should be interpretable, comprehensible, and usable. The
interpretability reflects how easily the data is understood.

18
APPLICATIONS OF CLUSTER ANALYSIS
• It is widely used in image processing, data analysis, and pattern recognition.
• It helps marketers to find the distinct groups in their customer base and they can characterize their
customer groups by using purchasing patterns.
• It can be used in the field of biology, by deriving animal and plant taxonomies and identifying genes
with the same capabilities.
• It also helps in information discovery by classifying documents on the web.

19
ADVANTAGES OF CLUSTER ANALYSIS
1. It can help identify patterns and relationships within a dataset that may not be
immediately obvious.
2. It can be used for exploratory data analysis and can help with feature selection.
3. It can be used to reduce the dimensionality of the data.
4. It can be used for anomaly detection and outlier identification.
5. It can be used for market segmentation and customer profiling.

20
DISADVANTAGES OF CLUSTER ANALYSIS
1. It can be sensitive to the choice of initial conditions and the number of clusters.
2. It can be sensitive to the presence of noise or outliers in the data.
3. It can be difficult to interpret the results of the analysis if the clusters are not well-
defined.
4. It can be computationally expensive for large datasets.
5. The results of the analysis can be affected by the choice of clustering algorithm
used.
6. It is important to note that the success of cluster analysis depends on the data, the
goals of the analysis, and the ability of the analyst to interpret the results.

21
K MEANS
CLUSTERING

22
K MEANS
• K-Means Clustering is an Unsupervised Learning algorithm, which groups the unlabeled dataset into
different clusters.
• Here K defines the number of pre-defined clusters that need to be created in the process, as if K=2, there
will be two clusters, and for K=3, there will be three clusters, and so on.
• It is an iterative algorithm that divides the unlabeled dataset into k different clusters in such a way that
each dataset belongs only one group that has similar properties.
• It allows us to cluster the data into different groups and a convenient way to discover the categories of
groups in the unlabeled dataset on its own without the need for any training.
• It is a centroid-based algorithm, where each cluster is associated with a centroid.
• The main aim of this algorithm is to minimize the sum of distances between the data point and their
corresponding clusters.
• The algorithm takes the unlabeled dataset as input, divides the dataset into k-number of clusters, and
repeats the process until it does not find the best clusters.
• The value of k should be predetermined in this algorithm.

23
24
WORKING OF K MEANS
The working of the K-Means algorithm is explained in the below steps:
• Step-1: Select the number K to decide the number of clusters.
• Step-2: Select random K points or centroids. (It can be other from the input dataset).
• Step-3: Assign each data point to their closest centroid, which will form the predefined K clusters.
• Step-4: Calculate the variance and place a new centroid of each cluster.
• Step-5: Repeat the third steps, which means reassign each datapoint to the new closest centroid of each
cluster.
• Step-6: If any reassignment occurs, then go to step-4 else go to FINISH.
• Step-7: The model is ready.

25
26
27
28
GAUSSIAN
MIXTURES

29
30
VECTOR
QUANTIZATION

31
VECTOR QUANTIZATION
• Vector quantization is a technique in signal processing and data compression that involves mapping
high-dimensional data points (vectors) onto a set of representative vectors, called code vectors or
codewords, that form a codebook.
• The process of vector quantization involves dividing the vector space into a finite number of regions
and associating each region with a codeword.
• The goal of vector quantization is to reduce the amount of data required to represent a signal or image
while maintaining a reasonable level of fidelity or quality.
• In other words, it is a lossy compression technique that involves a trade-off between the size of the
codebook (which affects the quality of the reconstructed signal) and the compression ratio (which
affects the amount of compression achieved).Vector quantization has applications in a wide range of
fields, including speech and audio coding, image and video compression, pattern recognition, and data
mining.
• It is often used in conjunction with other compression techniques, such as transform coding and entropy
coding, to achieve even greater compression ratios

32
K MEDIODS
CLUSTERING

33
34
HIERARCHICAL
CLUSTERING

35
HIERARCHICAL CLUSTERING
• Hierarchical clustering is a clustering algorithm used in machine learning and data analysis to group
similar data points or objects into clusters based on their pairwise distances or similarities.
• It is a bottom-up approach that starts by treating each data point as a separate cluster and then
iteratively merges pairs of clusters that are most similar until a single cluster is formed.
• The resulting hierarchy of clusters can be represented as a dendrogram, which is a tree-like diagram
that shows the relationships between the clusters at different levels of the hierarchy.
• The height of each branch in the dendrogram represents the distance or dissimilarity between the
clusters that are being merged.
• There are two main types of hierarchical clustering: agglomerative and divisive.
• Agglomerative clustering is the more common approach and involves starting with individual data
points as clusters and then iteratively merging them together to form larger clusters.
• Divisive clustering, on the other hand, starts with a single cluster and then recursively divides it into
smaller clusters.
• One advantage of hierarchical clustering is that it does not require the number of clusters to be
specified beforehand, as the hierarchy can be cut at different levels to obtain different numbers of
clusters.
• Hierarchical clustering is commonly used in a variety of applications, such as image segmentation,
customer segmentation, and gene expression analysis.

36
37
SELF
ORGANIZING
MAPS

38
39
PRINCIPAL
COMPONENT
ANALYSIS
(PCA)

40
41
THANK YOU

You might also like