Professional Documents
Culture Documents
Machine Learning:: Session 1: Session 2
Machine Learning:: Session 1: Session 2
Machine Learning:: Session 1: Session 2
Session 1:
Session 2:
Segmentation and Clustering. Segmenting is the process of putting customers
into groups based on similarities, and clustering is the process of finding
similarities in customers so that they can be grouped, and therefore
segmented. But 2 different groups should have dissimilarities.
Companies might try to identify similarities based on demographic,
psychographic, product expectations.
Jaccard Coefficient: It is the ratio of all the common divided by Total attributes. The Jaccard distance
measures the similarity of the two data set items as the intersection of those items divided by
the union of the data items.
Jacard coeffient is useful when we have large number of atributues.
Manhattan Distance:
This determines the absolute difference among the pair of the coordinates.
Suppose we have two points P and Q to determine the distance between
these points we simply have to calculate the perpendicular distance of the
points from X-Axis and Y-Axis.
In a plane with P at coordinate (x1, y1) and Q at (x2, y2).
Manhattan distance between P and Q = |x1 – x2| + |y1 – y2|
Cosine Index:
Cosine distance measure for clustering determines the cosine of the angle
between two vectors given by the following formula. It is used if objects
represented as a vector.
Suppose we have 2 customer c1 and c2 which has both qualitative and
quantitively attributes. We need to calculate Euclidian distance and Jaccard.
Steps 3 and 4 will be repeated until centre does not change. Change
of centre point is a stoppage criterion.
We plot graph with x axis as K value and Y axis will be having within
cluster variance. We infer that after more cluster will not have sharp
decline of variance. That point is correct value of cluster. this method
is known as Elbow point
Shillohouttee score: this is another way to find right number of
groups. Means distance from the neighbour divided by distance from
the cluster. If this score is high, that is right value of K.
Session 3:
Treats each data point as one cluster. Then find similarity between 2 clusters based on least distance
or high similarity. Based on similarity it further reduces clusters. For clubbing of cluster we can use
4 criteria average, max ,min and word’s method.
Ward’s method: it checks for reduce variance among various clusters for clubbing.it gives good
performance.
If data is large than it does not provide good results. But for smaller data it provides to identify each
level of clubbing.
It takes 2 inputs
1)min pts
2)radius
It will check for each point if in radius minimum points are met. If yes that is dense region.
All other points which fall in dense region will keep on expanding the dense region.
It also identifies the noise or outlier in the data as points which are not in any cluster as consider as
noise.
For both the qualitative or quantities we have another algorithm knowns as K- prototype
Explained through code see 54:00 min for code and explanation
Association Rule Mining / market basket analysis: so far, we extracted the natural data
among the group. ARM means to get knowledge or information from the transaction data. It tries to
extract the association between data.it means relationship between any 2 sku’s . example: if
customer buys the laptop, it buys pen drive also. Laptops refer as head and pen drive as body. This is
very useful for product planning, customer recommendations etc.