Machine Learning:: Session 1: Session 2

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 15

Machine Learning:

Session 1:

Session 2:
Segmentation and Clustering. Segmenting is the process of putting customers
into groups based on similarities, and clustering is the process of finding
similarities in customers so that they can be grouped, and therefore
segmented. But 2 different groups should have dissimilarities.
Companies might try to identify similarities based on demographic,
psychographic, product expectations.

Data analysts must decide what is similarity to him. We need to know on


what features or characterises we decide the similarity.

For quantitively attributes we define similarity using Euclidean Distance:


 Euclidean Distance:
Euclidean distance is considered the traditional metric for problems with
geometry. It can be simply explained as the ordinary distance between two
points. It is one of the most used algorithms in the cluster analysis. One of
the algorithms that use this formula would be K-mean. Mathematically it
computes the root of squared differences between the coordinates
between two objects.

If distance between 2 entities is low, similarity is high or vice versa.

Similarity means 1 divided by distance.


Most of the time we find Euclidean Distance is good way to find the similarity.

Categorical attribute / Nominal Attributes: how to find the similarity between


categorical attributes.
We cannot use Euclidian distance. To find similarity matching attribute
divided by total attribute.

Jaccard Coefficient: It is the ratio of all the common divided by Total attributes. The Jaccard distance
measures the similarity of the two data set items as the intersection of those items divided by
the union of the data items.
Jacard coeffient is useful when we have large number of atributues.

Manhattan Distance:
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between
these points we simply have to calculate the perpendicular distance of the
points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|

Cosine Index:
Cosine distance measure for clustering determines the cosine of the angle
between two vectors given by the following formula. It is used if objects
represented as a vector.
Suppose we have 2 customer c1 and c2 which has both qualitative and
quantitively attributes. We need to calculate Euclidian distance and Jaccard.

We will rescale Euclidian distance between 0 and 1 as Jaccard coeffect is


always between 0 and 1.
Ordinal attributes: scale them between 0 and 1. Example: income ,
low ,high , medium , very high
Segmentation: below are the steps
1)Define problem -to what attributes you want to create clustering
2)Objective for which cluster is created. distance/ similarity Measure
3)Choice of clustering algorithm – example -k mean algorithm
4)Need to define how many groups needs to create.
5)Create, interpret and profiling of clusters.

Clustering is which deals with process of creating groups and segmentation


is the process of connecting clustering to the business activities.

k mean algorithm – works only on quantitively attributes

clustering is unsupervised learning.


Case study: Geico insurance
Step 1

K mean Clustering : by default K mean use Euclidian distance


Steps in K mean:
1) Number of clusters you want to create
2) Randomly choose clusters out of data points
3) Compute distance of all other points and assign each point to nearest
cluster.
4) Compute new centre of each cluster by taking average of each
dimension.

Steps 3 and 4 will be repeated until centre does not change. Change
of centre point is a stoppage criterion.

How to find how many groups to be create: as we keep on increasing


clusters, within cluster variance will reduce.

We plot graph with x axis as K value and Y axis will be having within
cluster variance. We infer that after more cluster will not have sharp
decline of variance. That point is correct value of cluster. this method
is known as Elbow point
Shillohouttee score: this is another way to find right number of
groups. Means distance from the neighbour divided by distance from
the cluster. If this score is high, that is right value of K.

How to profile the Clusters:


1)demographic
2)sustainability – size of segment, competitors
3)accessibility – mechanism to reach out to customers
4)actionability -
Go again after 2:25 hrs to see how to do in Python

Session 3:

Hierarchical clustering: Works only on quantitative attributes like K mean.

Below are its types


1) Agglomerative clustering:

Treats each data point as one cluster. Then find similarity between 2 clusters based on least distance
or high similarity. Based on similarity it further reduces clusters. For clubbing of cluster we can use
4 criteria average, max ,min and word’s method.

Ward’s method: it checks for reduce variance among various clusters for clubbing.it gives good
performance.

If data is large than it does not provide good results. But for smaller data it provides to identify each
level of clubbing.

Code watch from 17 :00 min for code in python .


DBSCAN algorithm: it is 3rd type of clustering. It means density base special clustering
application to noise. It identifies dense region in the data point, so it creates clusters on own not like
K mean.

It takes 2 inputs

1)min pts

2)radius

It will check for each point if in radius minimum points are met. If yes that is dense region.

All other points which fall in dense region will keep on expanding the dense region.

It also identifies the noise or outlier in the data as points which are not in any cluster as consider as
noise.

For code check 36:00 for python code.


K means and hierarchical only create clusters where data is placed in linear line. But for DBSCAN we
need a circular boundary such as data related to healthcare, DB SCAN. We should use Db scan for
outlier to find the noise.
Segmentation of qualitative Attributes:
1) Either create own distance like Jaccard distance and average
2) K-Mode algorithm

For both the qualitative or quantities we have another algorithm knowns as K- prototype

Explained through code see 54:00 min for code and explanation

Association Rule Mining / market basket analysis: so far, we extracted the natural data
among the group. ARM means to get knowledge or information from the transaction data. It tries to
extract the association between data.it means relationship between any 2 sku’s . example: if
customer buys the laptop, it buys pen drive also. Laptops refer as head and pen drive as body. This is
very useful for product planning, customer recommendations etc.

You might also like